## Figures

## Abstract

A common problem that is encountered in medical applications is the overall homogeneity of survival distributions when two survival curves cross each other. A survey demonstrated that under this condition, which was an obvious violation of the assumption of proportional hazard rates, the log-rank test was still used in 70% of studies. Several statistical methods have been proposed to solve this problem. However, in many applications, it is difficult to specify the types of survival differences and choose an appropriate method prior to analysis. Thus, we conducted an extensive series of Monte Carlo simulations to investigate the power and type I error rate of these procedures under various patterns of crossing survival curves with different censoring rates and distribution parameters. Our objective was to evaluate the strengths and weaknesses of tests in different situations and for various censoring rates and to recommend an appropriate test that will not fail for a wide range of applications. Simulation studies demonstrated that adaptive Neyman’s smooth tests and the two-stage procedure offer higher power and greater stability than other methods when the survival distributions cross at early, middle or late times. Even for proportional hazards, both methods maintain acceptable power compared with the log-rank test. In terms of the type I error rate, Renyi and Cramér—von Mises tests are relatively conservative, whereas the statistics of the Lin-Xu test exhibit apparent inflation as the censoring rate increases. Other tests produce results close to the nominal 0.05 level. In conclusion, adaptive Neyman’s smooth tests and the two-stage procedure are found to be the most stable and feasible approaches for a variety of situations and censoring rates. Therefore, they are applicable to a wider spectrum of alternatives compared with other tests.

**Citation: **Li H, Han D, Hou Y, Chen H, Chen Z (2015) Statistical Inference Methods for Two Crossing Survival Curves: A Comparison of Methods. PLoS ONE 10(1):
e0116774.
https://doi.org/10.1371/journal.pone.0116774

**Academic Editor: **Zhongxue Chen,
Indiana University Bloomington, UNITED STATES

**Received: **September 20, 2014; **Accepted: **December 14, 2014; **Published: ** January 23, 2015

**Copyright: ** © 2015 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

**Data Availability: **Data are available from the book (Klein JP, Moeschberger ML. (2003) Survival analysis techniques for censored and truncated data. 2th ed. New York: Springer) and SEER (The Surveillance, Epidemiology, End Results Program).

**Funding: **This work was supported by the National Natural Science Foundation of China (81202288). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

In clinical studies, the task of comparing the overall equality of two survival distributions with censored observations is a key element in survival analysis. It is well known that the commonly used log-rank test has optimum power under the assumption of proportional hazard rates. However, this assumption is often violated, especially when two survival curves cross each other. It has been demonstrated that the log-rank test may lose power in this situation [1–5]. The phenomenon of crossing survival curves is common [6–15] when a treatment may offer a short-term benefit but does not provide long-term advantages. In a study conducted by Mok [12], the log-rank test was used to evaluate the effects of two treatments even when the two survival curves crossed. However, Seruga et al. [16] and Bouliotis et al. [17] have observed that if the curves cross, there is a clear violation of the proportional-hazards assumption, and they have suggested the use of other statistical methods that do not rely on these assumptions. Kristiansen [18] has reviewed all publications in five notable journals and identified 175 studies concerning survival analysis, among which crossing survival curves were present in 47% of the studies. Of those studies in which crossing survival curves were present, the log-rank test was performed in 70% of the tests, and only 31% of them reported testing for proportional hazards. Using the log-rank test under conditions of non-proportional hazards may lead to misleading results, therefore, Bouliotis et al. [17] have stated that “*there is a need in the clinical community to clarify methods that are appropriate when survival curves cross*.”

In the literature, there are a fair number of statistical methodologies for addressing the crossing-curves problem. Gill [19] has proposed a supremum version of the weighted log-rank test, the Renyi test. Fleming et al. [20] have developed the modified Kolmogorov—Smirnov test for the correct comparison of crossing survival curves. Koziol [21] and Schumacher [22] have generalized the Cramér—von Mises test to censored data. The above methods are based on the integrated difference of the Nelson—Aalen cumulative hazard functions.

Another category of methods is the weighted Kaplan—Meier tests based on the integrated weighted difference of two Kaplan—Meier estimators. Pepe and Fleming [23] have demonstrated that weighted Kaplan—Meier tests perform better than the log-rank test in the crossing hazards scenario. To address the problem of choosing a suitable weight function in advance, some versatile tests have been developed based on the maximum version of the weighted Kaplan—Meier test [24] or the combination of the Fleming—Harrington test [25]. Lin and Wang [4] have proposed a test statistic that measures the squared differences at each time point. Lin and Xu [1] have designed a specialized method to compare two crossing survival curves based on the absolute difference between the area under the two curves. Qiu and Sheng [2] have suggested a two-stage procedure in which the log-rank test serves as the first stage and a proposed procedure for addressing the crossing hazard rates is applied in the second stage. Kraus [26] has constructed a class of Neyman’s smooth tests based on the concept of Neyman’s embedding and a data-driven strategy.

For most of the procedures mentioned above, the performances of the corresponding test statistics have been evaluated by considering several survival scenarios. Simulation studies have demonstrated that the tests developed by Lin and Wang [4] and by Lin and Xu [1] perform better than do the log-rank and Wilcoxon tests in the case of crossing survival curves. Tubert-Bitter et al. [27] have considered early, middle and late crossings of the survival curves. Moreover, Liu et al. [3] have conducted a comprehensive simulation study of three different patterns of crossing hazard rates and have demonstrated that some of weighted log-rank tests (log-rank, Gehan—Wilcoxon and Peto—Peto) lose power compared with methods that are specifically designed to address the problem of crossing hazard rates.

As mentioned above, few simulation studies have accounted for the scenario in which two survival curves cross at some particular location, which is a focal point to be discussed in our study. Most studies have considered only a few of the new testing approaches and have not evaluated the statistical properties of these new approaches under high censoring rates. Therefore, we performed Monte Carlo simulations to investigate whether the power of these methods was robust under various configurations of crossing survival curves and to determine the behavior of these methods with variations in censoring rates. Our objective is to evaluate the strengths and weaknesses of the tests for a variety of situations and censoring rates and to recommend an appropriate test that will not fail for a wide range of applications. The article is structured as follows: In the following section, we briefly introduce several existing methods. The statistical properties of these methods in a variety of situations are presented in Section 3. All methods are illustrated using three examples in Section 4. Finally, in Section 5, we summarize our findings and present a general discussion.

## Methods

### Assumptions and notations

Our study focuses on the comparison of methods for testing the omnibus hypothesis:
for two survival functions *S*_{1}(*t*) and *S _{2}* (

*t*) that cross each other. If two survival functions cross each other, then the corresponding hazard rates also cross each other [3].

Consider two censored samples of sample size *N*_{i} (*i* = 1, 2), and let *t*_{1} < … < *t _{j}* < … <

*t*(

_{τ}*j*= 1, 2, …,

*τ*) be the distinct failure times in the pooled sample, where

*t*is the latest event time at which two groups contain at least one subject at risk. The Kaplan—Meier estimator is defined as for

_{τ}*t*<

*t*

_{1}and for

*t*≥

*t*

_{1}, where

*d*and

_{j}*n*denote the number of observed failures and the number of individuals who are at risk at time

_{j}*t*, respectively.

_{j}### Existing methods

We first consider methods constructed based on the difference between the Nelson—Aalen cumulative hazard functions. It is well known that when different weight functions are selected, the weighted log-rank test can generate a variety of methods, such as the log-rank (LR) test, the Gehan—Wilcoxon [28] (GW) test, the Tarone-Ware [29] (TW) test and a general class of Fleming-Harrington [30–31] (*G ^{ργ}*) tests with the weight function . Even so, when crossing survival curves appear, early positive differences between the two hazard rates are canceled out by later differences of the opposite sign, which may cause some weighted log-rank tests to lose power [3]. As a result, considerable efforts have been directed toward improving the sensitivity of testing approaches in such cases. Instead of taking the sum over all time points, Gill’s [19] Renyi (RY) test improves the testing power by replacing the numerator of the weighted log-rank statistic with a supremum. Fleming et al. [20] have also modified the classical Kolmogorov—Smirnov test and generalized it for use with arbitrarily right-censored data. Analogous to the Renyi test, the modified Kolmogorov—Smirnov statistic (MKS) is also a supremum approach, but this test is more sensitive to apparent differences in survival distributions at a single point in time. Moreover, Schumacher [22] has generalized the classical Cramér—von Mises test for application to right-censored data, modifying the test to be based on the integrated squared difference between two estimated hazard rates [31]. There are two types of Cramér—von Mises statistics: one is asymptotic to a standard Brownian motion process (CVM1), and the other converges to a Brownian bridge process (CVM2) [32].

In another class of methodologies, which includes those that are constructed in terms of the difference between two estimated survival functions, Pepe and Fleming [23] have introduced a class of statistics that constitute the Weighted Kaplan—Meier test (WKM), which is based on the integrated weighted difference between Kaplan—Meier estimators. Similar to the weighted log-rank test, a prior misspecification of the weight function may decrease the power of the WKM test. Of greater concern is the fact that a series of trials using different test statistics may lead to a multiple comparison problem and to the inflation of the overall type I error rate. These shortcomings of the WKM test motivated Shen and Cai [24] to propose a maximum WKM test (MKM) using a set of weighted Kaplan—Meier statistics {*V*_{1}, *V*_{2}, … *V*_{m}}, where the weight functions are generated from all possible combinations of *ρ* = 0, 1, 2 and *γ* = 0, 1, 2 in the function . Meanwhile, with appropriate justification, the type I error rate is well controlled in a multivariate Gaussian distribution.

In recent years, many scholars have proposed new approaches for detection in a wide range of circumstances. Lee [25] has developed three versatile tests based on the combination of two Fleming—Harrington statistics using *ρ* = 1, *γ* = 0 and *ρ* = 0, *γ* = 1. These combined tests are based on (1) the absolute value of the average of the two statistics (SHL1), (2) the average of their absolute values (SHL2) and (3) the maximum of their absolute values (SHL3). Using the method proposed by Yang et al. [33], the critical values and *P* values of these latter two versatile tests can be obtained. Lin and Wang [4] (LW) have proposed a new approach to comparing the overall homogeneity of survival curves by measuring the squared differences between the number of observed failures and the number of expected failures over time. Lin and Xu [1] have suggested a new method for comparing survival distributions based on the absolute difference between the area under two survival curves; here, the one-sided and two-sided test are denoted by LX1 and LX2, respectively. With the intent of addressing all possible alternatives, Qiu and Sheng [2] have proposed a two-stage procedure (TS). In stage one, a weighted log-rank test, such as the log-rank test, is conducted. If a rejection of the null hypotheses is obtained in stage one, then the entire procedure is terminated; otherwise, the result suggests that either the two hazard rates may be identical or they may cross each other, but these two scenarios cannot be distinguished by the stage-one procedure. Thus, the method proceeds to stage two, in which the basic idea is to choose a particular weight for the weighted log-rank test that changes signs before and after a potential crossing point. Because the two test statistics in two individual stages are proved to be asymptotically independent of each other under *H*_{0}, then the p-value of the entire test can be redefined. Kraus’s test [26] is based on the concept of Neyman’s embedding combined with Schwarz’s selection rule. The null hypothesis *h*_{1}(*t*) = *h*_{2}(*t*) is tested against a model in which the hazard ratio of the two survival distributions is expressed by *d* smooth functions. By constructing the partial likelihood score statistic *T _{d}*,

*d*fixed-dimensional smooth tests are defined. However, one would not know how to specify the number of smooth functions for the test prior to analysis. Therefore, the author introduced Schwarz’s selection rule and proposed a more reasonable data-driven test statistic,

*T*

_{s}. When the number of high priority basis functions

*d*

_{0}is specified, the data-driven test statistics of nested subsets or of all subsets are asymptotically

*χ*

^{2}distributed with different degrees of freedom. The author has demonstrated that in practice, a test with a relatively small fixed dimension

*d*or with nested subsets of smaller

*d*

_{0}often performs very well. See Kraus [26] for more details. Based on the results of the simulation studies cited by Kraus, we selected a fixed-dimensional test statistic

*T*(

_{d}*d*= 4) (NY1) and a data-driven test statistic

*T*(

_{s}^{nested}*d*= 8,

*d*

_{0}= 0) (NY2) for further simulations because these statistics exhibited the greatest stability of discrimination power among all evaluated parameters. For consistency with the original text, both tests were performed using 2000 permutations. A brief summary of the methods investigated in this study is listed in S1 Table.

## Simulation

### Simulation design

To assess the performance of the tests mentioned above, we conducted Monte Carlo simulations for various random censoring rates (0%, 20%, 40% and 60%) and the following situations: (A) two groups with proportional hazard rates, (B) two crossing survival curves with the crossing point located at *S*(*t*)>0.6, (C) two survival curves crossing at *S*(*t*) = 0.4~0.6 and (D) two survival curves crossing at *S*(*t*) = 0.2~0.4. Meanwhile, we considered equal sample sizes (*N*_{1} = *N*_{2} = 20, 50, 100) and unequal sample sizes (*N*_{1} = 20, *N*_{2} = 30; *N*_{1} = 40, *N*_{2} = 80; *N*_{1} = 20, *N*_{2} = 50; *N*_{1} = 50, *N*_{2} = 100) in each group. For each sample size, 5000 iterations were performed for each situation and censoring rate. The exact power and size of these test statistics were estimated by determining the proportion of samples for which the null hypothesis was rejected at the *α* = 0.05 significance level.

A simulation of time-to-event data was performed based on a randomly censored model [34–35]. We generate individual failure times *X* following the survival functions (A)-(D) in Fig. 1; (B)-(D) are defined to be approximately equal to the real data described in Example and (A) a proportional hazards model for reference. Next, the censoring time *Cr* in two samples was generated from uniform distributions *U* (0, *a*) and *U* (0, *b*), where varying the values of *a* and *b* may result in censoring rates of approximately 20%, 40% or 60% in the two samples. Each individual was assigned an observed survival time *T* = min(*X*, *Cr*) and an event indicator *δ* = I[*X*≤*Cr*]. Because the lifetime *X* followed different distributions in each group, it was necessary for the values of *a* and *b* to be unequal to keep the average censoring rates in each group approximately equal to the given censoring rates.

* *λ*_{i} denotes the hazard function of *S*_{i}(*t*), a piecewise exponential distribution. † W(*η*, *θ*) denotes a Weibull distribution having *S*(*t*) = exp{(-*t* / *θ*)^{η}}, where *η*>0 is the shape parameter, *θ*>0 is the scale parameter.

### Estimation of statistical power and type I error

**Situation 1**. When two survival curves have proportional hazard functions (see Fig. 1A), the powers of various test statistics rise with increasing sample size. By contrast, for a particular sample size, the powers of the tests decline with increasing censoring rate (see Table 1). As expected, the LR test demonstrates the optimal power in this situation, followed by the SHL1, SHL2 and SHL3 tests. When each sample size is equal to 20 and when the censoring rate is increased to 60%, the power of each test becomes less than 40%, a relatively low level. However, when each sample size is greater than or equal to 50 while the censoring rate is simultaneously no more than 20%, all tests demonstrate powers above 80%. In general, the LR, SHL1, SHL2 and SHL3 tests are far superior to the other tests under the assumption of proportional hazard rates.

**Situation 2**. For the early (*S*(*t*)>0.6) crossing of the survival curves (Fig. 1B), Table 2 summarizes the results of the power estimations. When the sample size of each group is not less than 50, we note that the powers of all tests gradually decrease with increasing censoring rate. By contrast, for small sample sizes (*N*_{1} = *N*_{2} = 20), the powers of most tests (except GW, CVM1 and WKM) increase unexpectedly when the censoring rate reaches 60%. High censoring rates appears to have considerable influence on the power of the G10, TW, GW and WKM tests. When the censoring rate is increased to 60%, these four statistics suffer significant loss of power (decreasing to 8.14%, 7.66%, 3.26% and 2.24%, respectively). The weight function of the WKM test is associated with the Kaplan—Meier estimator , which can be very unstable in the presence of heavy censoring. To remedy this instability, the statistic down-weights the contribution of if the censoring is heavy, and as a result, the WKM test loses power in such a situation. In contrast, the TW, GW and G10 tests assign a greater weight to earlier failure times, causing these tests to be insensitive to differences at later times. Overall, the G01, G11, SHL3, LW, LX, TS, NY1 and NY2 tests perform better than do the others. Among them, the NY1 and NY2 tests are the most stable, and except for small (*N*_{1} = *N*_{2} = 20) sample sizes, the powers of both statistics exceed 80% for various combinations of sample sizes and censoring rates.

**Situation 3**. When considering middle (*S*(*t*) = 0.4~0.6) crossing of the survival curves (Fig. 1C), the NY2, NY1, G01 and SHL3 tests, followed by the TS, G11, MKS, SHL2, LW and LX tests, exhibit much better performances than the remaining tests for low and moderate censoring rates. However, as Table 3 demonstrates, when the censoring rates are gradually increased to 60%, the powers of the G01, G11, SHL2, SHL1 and LW tests decrease sharply to 4.58%, 5.14%, 6.70%, 6.70% and 2.60%, respectively. This result implies that these five tests are more likely to be affected by high censoring rates. Meanwhile, tests that are not designed to address the crossing-hazard problem, such as the LR, GW and TW tests, maintain a relatively low power in most cases. After a comprehensive comparison of these tests, it is apparent that NY1, NY2, TS, MKS and SHL3 are most applicable for scenarios with middle crossings of survival curves and with high censoring rates. When each sample size is increased 100, these four tests (NY1, NY2, TS and MKS) maintain powers of greater than 80% for censoring rates between 0% and 60%. When a small sample size (*N*_{1} = *N*_{2} = 20) is considered, NY1 and NY2 perform slightly better at various censoring rates.

**Situation 4**. For late (*S*(*t*) = 0.2~0.4) crossing of the survival curves (Fig. 1D), the results are presented in Table 4. For increasing censoring rates, the MKS, TS, NY1 and NY2 tests exhibit gradually decreasing power, whereas the powers of the LR, G10, GW, TW, RY, CVM1, WKM and MKM tests increase, and the other tests exhibit various degrees of fluctuation. However, for small (*N*_{1} = *N*_{2} = 20) and unequal (*N*_{1} = 20, *N*_{2} = 50) sample sizes and for high censoring rates, the NY2, NY1 and TS tests are the most powerful tests, followed by the MKS, SHL3, CVM2, SHL2, MKM, CVM1 and RY tests. When each sample size is equal to 100, the tests listed above demonstrate powers greater than 98% regardless of the censoring rate. Although the LW and LX tests are both specialized methods for detecting differences when the survival curves cross, they exhibit lower power for small (*N*_{1} = *N*_{2} = 20) and unequal (*N*_{1} = 20, *N*_{2} = 30; *N*_{1} = 20, *N*_{2} = 50) sample sizes. In addition, it is evident that the LW test is more likely to be affected by high censoring rates and unbalanced design. Therefore, the NY2, NY1 and TS tests are most robust for late crossings of the survival curves.

**Situation 5**. To investigate type I errors, two samples were independently generated from an exponential distribution with a hazard rate of *λ*_{1} = *λ*_{2} = 0.25. Various censoring rates were also considered by setting various uniform distribution parameters. The estimated type I error rates are presented in Table 5. For all combinations of sample sizes and censoring rates, the type I error rates for G01, SHL1, SHL2 and SHL3 are slightly inflated, whereas the RY test is more conservative. Although the CVM1 and CVM2 tests are relatively conservative for small sample sizes (*N*_{1} = *N*_{2} = 20), both statistics gradually approach the nominal level of 0.05 as the sample sizes increase. Note that the type I error rates of the LX1 (one-sided) and LX2 (two-sided) tests appear to increase with increasing censoring rates, increasing from 0.0328 to 0.0868 and from 0.0194 to 0.0606, respectively. When the censoring rate reaches 60%, the type I error rate of the LX1 test exceeds the nominal level for all combinations of sample sizes. In addition to the above tests, all statistics are quite close to the nominal level of 0.05.

## Applications

In this section, we apply the above testing procedures to three real data examples that correspond to Situations 2–4. The first example consists of data from kidney dialysis patients, which are described in detail by Klein and Moeschberge [31]. The second consists of data regarding breast cancer patients obtained from the Surveillance, Epidemiology, and End Results (SEER) database. The last example concerns the comparison of two types of treatments for gastric cancer [36].

### Example 1 Kidney dialysis data

The kidney-dialysis trial was designed to assess the time to the first exit-site infection (in months) in patients with renal insufficiency. In 43 patients, the catheter was surgically implanted, whereas in 76 patients, the catheter was percutaneously placed. Catheter failure was the primary reason for censoring. There were 27 patients censored in the first group and 65 in the second, corresponding to censoring rates of 62.79% and 85.53%, respectively. As is apparent in Fig. 2A, the two survival curves cross at an early time, and there is a clear difference at later times. Moreover, a test of the proportional-hazards assumption [37] indicated that the assumption is indeed violated in this case (*χ*^{2} = 8.696, *P* = 0.003). The relevant statistics from various tests and the corresponding *P*-values are listed in Table 6. The conventional LR, G10, GW and TW tests offer relatively low power in this situation because the positive differences are cancelled out by the negative differences, leaving them unable to detect the overall differences. As Table 2 indicates, the RY, CVM1, CVM2, WKM and MKM tests are unstable in the case of heavy censoring. As a result, the above tests fail to identify significant differences between the two groups. However, the G01, SHL1, SHL2, SHL3, MKS, LW, LX, TS, NY1 and NY2 tests yield significant results, and this finding is consistent with the simulation results.

### Example 2 Breast cancer data from SEER

The Surveillance, Epidemiology, End Results (SEER) Program collects data concerning cancer cases in various locations and from sources throughout the United States. We extracted the data regarding 144 black patients who were first diagnosed with breast cancer after the year 2000. Among these patients, we focus only on the comparison between the following two groups: Group 1 consists of 35 patients who underwent radiation therapy as their first course of treatment, and group 2 consists of 109 patients whose guardians refused radiation therapy. The censoring rates of the two groups were 51.43% and 57.80%, respectively. As Fig. 2B illustrates, the two survival curves cross each other near *S*(*t*)=0.6. Meanwhile, a test of the proportional-hazards assumption yielded a result of *χ*^{2} = 9.470, *P* =0.002. From Fig. 2B, it is evident that radiation therapy could improve a patient’s survival rate if administered in the early period; however, it could not provide long-term benefits, as patients in group 2 survived longer than did those in group 1 after 50 months. The median survival time of each group was 50 months and 79 months, respectively. As shown in Table 6, the G01, CVM2, LX1, TS, NY1 and NY2 tests conclude that the treatment effects in the two groups are significantly different. Because Fig. 2B is analogous to the configurations observed for Situation 3, the result is also comparable.

### Example 3 Gastric cancer data

Stablein and Koutrouvelis [36] have compared two types of therapies (chemotherapy versus chemotherapy combined with radiotherapy) in the treatment of locally unresectable gastric cancer. In this study, 45 patients were each randomly allocated to one of two groups, with censoring rates of 4.44% and 13.33%, respectively. Fig. 2C presents the two survival functions, which cross at approximately *S*(*t*) = 0.2, and the corresponding test indicated a severe violation of the proportional-hazards assumption (*χ*^{2} = 13.127, *P* < 0.001). As observed in Table 6, the G10 and GW tests are most sensitive to early differences. Moreover, the MKS, CVM1, CVM2, MKM, TS, NY1 and NY2 tests, which demonstrated powers of over 80% in the corresponding simulation scenario, also yielded significant results, as expected. During the initial 1000 days, the use of chemotherapy resulted in a superior survival rate, whereas chemotherapy combined with radiation therapy was associated with an increased number of early deaths, which may be attributable to the progression of tumors within the radiation field or to nutritional and hematological complications. However, chemotherapy combined with radiation therapy appeared to offer better prospects for long-term survival during the late follow-up period.

## Discussion

In clinical research, the violation of the proportional-hazards assumption is often encountered, particularly when two survival curves cross each other. In this paper, we considered a number of tests for the comparison of two survival distributions and investigated their performances for various sample sizes and censoring rates. The objective of our study was to choose a statistical inference method that is appropriate for use when survival curves cross. Because survival functions are more common and intuitive than hazard functions when investigating the survival differences between two groups, simulations were performed for situations in which the survival distributions crossed at early, middle and late times. The simulation results demonstrated that the NY1, NY2 and TS tests had a robust power under various situations of crossing survival curves with reasonable type I error rates. The study confirmed that the location of the crossing point may affect the discrimination power of the test statistic employed. Pepe and Fleming [23] have claimed that the WKM test performs much better than does the LR test in the case of crossing hazards; however, the simulation results confirmed this conclusion only for the late crossing of the survival curves (Situation 4). Moreover, the simulation results also suggested that when the two survival curves cross at an early time (Situation 2) and when the censoring rate is low (≤20%), the LR test maintains a reasonable power of greater than 80%, which is consistent with the results of Tubert-Bitter [27].

A simulation of time-to-event data was performed using a randomly censored model. To ensure that the average censoring rates in each tested group were consistent with the specified censoring rates, it was necessary for the uniform parameters of the two groups to be unequal. In Situaion 4, as the censoring rate was increased from 20% to 40%, the powers of several weighted log-rank tests (LR, G10, GW and TW) improved, as observed by Tubert-Bitter [27]. Such an increase in the censoring rate may lead to a shortening of the maximum length of the follow-up, which reduces the cancellation of positive differences by negative ones, thereby causing the powers of the four conventional tests to unexpectedly increase. However, for the methods specifically designed to address crossing hazards, such as the MKS, TS, NY1 and NY2 tests, their powers gradually decrease as the censoring rate increases. When the censoring rate reaches 60%, the maximum follow-up length (*t* = 2.5) is below the theoretical crossing point (*t* = 5.482). In this particular situation, late crossings cannot be observed for some samples. In a test of the proportional-hazards assumption [37], we found that in 5000 iterations, only 551 samples (11.02%) were non-proportional, whereas 88.98% of the samples satisfied the proportional-hazards assumption. This significant difference in statistics might explain the sudden increase in power of some tests when the censoring rate reached 60%. Moreover, for higher censoring rates, the estimated rate of type I errors for the LX test is quite unstable. Lin and Xu [1] have suggested that the value of the correlation coefficient *ρ _{j, j’}* might influence the type I error rate; therefore, these authors set the correlation coefficient to 0.5 to maintain a reasonable type I error rate and a high power. However, as is demonstrated by Tables 1–5 and by the sensitivity analysis presented in S1 Text, better performance can be achieved if the value of

*ρ*is varied based on the specific conditions of the problem.

_{j, j’}When comparing the limitations and strengths of the various methods, we first find that the performances of the WKM and weighted log-rank tests (LR, G01, G10, G11, GW and TW) are subject to the choice of weight functions. A misspecification of the weights can lead to a substantial loss of power. Moreover, the creation of multiple comparison problems and the inflation of the overall type I error rate might cause difficulties for the user. Second, some modified tests such as the SHL1, SHL2 and SHL3 tests are more likely to be influenced by a high censoring rate. Third, the LW test exhibits lower power in the cases of heavy censoring rates and unbalanced design, and the estimated type I error rate of the LX test often deviates from the nominal level of 0.05. Both methods require much future research. By contrast, as evidenced by the performances of all tests in various scenarios, the NY1, NY2 and TS tests exhibit higher power and greater stability than the other methods even for heavy censoring rates, and the type I error rates for these tests are also close to the nominal level. Because the TS test is not subject to weight functions, it is a particularly convenient method for addressing all possible alternatives. In conclusion, for the testing of the equality of two survival distributions, the NY1, NY2 and TS tests are the most appropriate and robust of all tests evaluated here, especially in the case of two crossing survival curves.

All testing approaches mentioned above are intended for overall hypothesis testing. However, sometimes the greatest interest is focused on which treatment can provide the better short- or long-term survival rate, even when the survival curves cross. Klein et al. [38] have reviewed a number of methods for comparing two survival curves at a fixed point in time. Moreover, Logan et al. [39] have considered several methods for the long-term survival comparison after a pre-specified time point.

For non-proportional hazards problem, Liu et al. [3] have highlighted the relationship between the crossing of survival functions and the crossing of hazard rates: “*if two continuous survival functions cross each other once*, *then the corresponding hazard rates must cross each other at least once*. *However*, *it is possible that survival functions do not cross*, *but their hazard rates cross very often*.” Overall, methods based on hazard rates are more sensitive in terms of capturing the differences and contain more information. Therefore, analyzing the survival curves is not sufficient. In clinical practice, it is necessary to explore the crossing time point of two crossing hazard functions at which the new treatment effect may change reversely, in comparison with the standard treatment. Cheng et al. [40] have developed a nonparametric procedure for constructing the confidence intervals for the first crossing point of two hazard functions. In addition to the patients’ survival data, longitudinal data regarding some time-dependent covariates should be considered to assess the effect of treatment. Recently, Park and Qiu [41] have proposed a joint modeling procedure for analyzing both the survival and longitudinal data with crossing hazard rate functions. In summary, depending on the particular conditions of the study, researchers may wish to consider overall hypothesis testing in combination with the various methods for a more reliable and comprehensive statistical analysis of the survival data.

## Supporting Information

### S1 Table. Hypothesis tests for two-sample survival data.

https://doi.org/10.1371/journal.pone.0116774.s001

(PDF)

### S1 Text. Factors affected the power and type I error rate of Lin-Xu test.

The relations among the censoring rate, the correlation coefficient and the power and type I error rate for N1 = N2 = 50 with 5000 iterations for LX1 and LX2 tests. In Fig. 1, it can be observed that the power decreases as the censoring rate and correlation coefficient increase. In Fig. 2, when the censoring rate is specified, the type I error rate decreases as the correlation coefficient increases.

https://doi.org/10.1371/journal.pone.0116774.s002

(PDF)

## Acknowledgments

The authors thank the associate editor and two referees for their valuable comments that greatly improved the quality of this paper.

## Author Contributions

Conceived and designed the experiments: ZC HML. Performed the experiments: HML DH HLC. Analyzed the data: HML ZC. Contributed reagents/materials/analysis tools: ZC. Wrote the paper: HML YWH ZC.

## References

- 1. Lin X, Xu Q (2010) A new method for the comparison of survival distributions. Pharm Stat 9(1): 67–76. pmid:19306313
- 2. Qiu P, Sheng J (2008) A two-stage procedure for comparing hazard rate functions. J R Statist Soc B 70(1): 191–208.
- 3. Liu K, Qiu P, Sheng J (2007) Comparing two crossing hazard rates by Cox proportional hazards modelling. Stat Med 26(2): 375–391. pmid:16538703
- 4. Lin X, Wang H (2004) A new testing approach for comparing the overall homogeneity of survival curves. Biometrical J 46(5): 489–496.
- 5.
Le CT (2003) Statistical methods for the comparison of crossing survival curves. In: Balakrishnan N and Rao CR (eds) Handbook of Statistics, Vol.23. Amsterdam: Elsevier, pp.277–289.
- 6. Weeda VB, Murawski M, McCabe AJ, Maibach R, Brugières L, et al. (2013) Fibrolamellar variant of hepatocellular carcinoma does not have a better survival than conventional hepatocellular carcinoma-Results and treatment recommendations from the Childhood Liver Tumour Strategy Group (SIOPEL) experience. Eur J Cancer 49(12): 2698–2704. pmid:23683550
- 7. Gahrton G, lacobelli S, Björkstrand B, Hegenbart U, Gruber A, et al. (2013) Autologous/reduced-intensity allogeneic stem cell transplantation vs autologous transplantation in multiple myeloma: long-term results of the EBMT-NMAM2000 study. Blood 121(25): 5055–5063. pmid:23482933
- 8. Smith MJF, Ridgway PF, Catton CN, Cannell AJ, O’Sullivan B, et al. (2014) Combined management of retroperitoneal sarcoma with dose intensification radiotherapy and resection: Long-term results of a prospective trial. Radiotherapy and Oncology 110: 165–171. pmid:24411227
- 9. Delarue R, Tilly H, Mounier N, Petrella T, Salles G, et al. (2013) Dose-dense rituximab-CHOP compared with standard rituximab-CHOP in elderly patients with diffuse large B-cell lymphoma (the LNH03–6B study): a randomised phase 3 trial. Lancet Oncol 14(6): 525–533. pmid:23578722
- 10. Samandari T, Agizew TB, Nyirenda S, Tedla Z, Sibanda T, et al. (2011) 6-month versus 36-month isoniazid preventive treatment for tuberculosis in adults with HIV infection in Botswana: a randomised, double-blind, placebo-controlled trial. Lancet 377(9777): 1588–1598. pmid:21492926
- 11. Krishnan A, Pasquini MC, Logan B, Stadtmauer EA, Vesole DH, et al. (2011) Autologous haemopoietic stem-cell transplantation followed by allogeneic or autologous haemopoietic stem-cell transplantation in patients with multiple myeloma (BMT CTN 0102): a phase 3 biological assignment trial. Lancet Oncol 12(13): 1195–1203. pmid:21962393
- 12. Mok TS, Wu Y, Thongprasert S, Yang C, Chu D, et al. (2009) Gefitinib or carboplatin-paclitaxel in pulmonary adenocarcinoma. New Engl J Med 361(10): 947–957. pmid:19692680
- 13. Gulizia M, Mangiameli S, Orazi S, Chiarandà G, Piccione G, et al. (2008) A randomized comparison of amiodarone and class IC antiarrhythmic drugs to treat atrial fibrillation in patients paced for sinus node disease: The Prevention Investigation and Treatment: A Group for Observation and Research on Atrial arrhythmias (PITAGORA) trial. Am Heart J 155(1): 100–101. pmid:18082498
- 14. Chew DP, Amerena JV, Coverdale SG, Rankin JM, Astley CM, et al. (2008) Invasive management and late clinical outcomes in contemporary Australian management of acute coronary syndromes: observations from the ACACIA registry. Med J Aust 188(12): 691–697. pmid:18558890
- 15. Tikidzhieva A, Benner A, Michel S, Formentini A, Link KH, et al. (2012) Microsatellite instability and Beta2-Microglobulin mutations as prognostic markers in colon cancer: results of the FOGT-4 trial. Br J Cancer 106(6): 1239–1245. pmid:22353804
- 16. Ballatori E, Fatigoni S, Roila F, Seruga B, Amir E, et al. (2009) Treatment of lung cancer. New Engl J Med 361(25): 2485–2487. pmid:20018971
- 17. Bouliotis G, Billingham L (2011) Crossing survival curves: alternatives to the log-rank test. Trials 12(Suppl 1): A137.
- 18. Kristiansen IS (2012) PRM39 Survival curve convergences and crossing: a threat to validity of meta-analysis? Value in health 15(7): A652.
- 19.
Gill RD (1980) Censoring and Stochastic Integrals. Mathematical Centre Tracts. Amsterdam: Mathematisch Centrum, 124.
- 20. Fleming TR, O’Fallon JR, O’Brien PC, Harrington DP (1980) Modified Kolmogorov-Smirnov test procedures with application to arbitrarily right-censored data. Biometrics 36(4): 607–625.
- 21. Koziol JA (1978) A two sample cramér-von mises test for randomly censored data. Biometrical J 20(6): 603–608.
- 22. Schumacher M (1984) Two-sample tests of Cramér-von mises and Kolmogorov-Smirnov type for randomly censored data. International Statistical Review 52(3): 263–281.
- 23. Pepe MS, Fleming TR (1989) Weighted Kaplan-Meier statistics: a class of distance tests for censored survival data. Biometrics 45(2): 497–507. pmid:2765634
- 24. Shen Y, Cai J (2001) Maximum of the weighted Kaplan-Meier tests with application to cancer prevention and screening trials. Biometrics 57(3): 837–843. pmid:11550935
- 25. Lee S (2007) On the versatility of the combination of the weighted log-rank statistics. Comput Stat Data Anal 51(12): 6557–6564.
- 26. Kraus D (2009) Adaptive Neyman’s smooth tests of homogeneity of two samples of survival data. J Stat Plan Infer 139(10): 3559–3569.
- 27. Tubert-Bitter P, Kramar A, Chalé J, Moreau T (1994) Linear rank tests for comparing survival in two groups with crossing hazards. Comput Stat Data Anal 18(5): 547–559.
- 28. Gehan EA (1965) A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52(1–2): 203–223. pmid:14341275
- 29. Tarone RE, Ware J (1977) On distribution-free tests for equality of survival distributions. Biometrika 64(1): 156–160.
- 30.
Fleming TR, Harrington DP (2005) Counting processes and survival analysis. Hoboken, New Jersey: John Wiley & Sons, Inc.
- 31.
Klein JP, Moeschberger ML (2003) Survival analysis techniques for censored and truncated data. 2th ed. New York: Springer.
- 32.
Andersen PK, Borgan Ø, Gill RD, Keiding N (1993) Statistical models based on counting processes. New York: Springer-Verlag, pp.393–395.
- 33. Yang S, Hsu L, Zhao L (2005) Combining asymptotically normal tests: case studies in comparison of two groups. J Stat Plan Infer 133(1): 139–158.
- 34. Crowther MJ, Lambert PC (2013) Simulating biologically plausible complex survival data. Stat Med 32(23): 4118–4134. pmid:23613458
- 35. Burton A, Altman DG, Royston P, Holder RL (2006) The design of simulation studies in medical statistics. Stat Med 25: 4279–4292. pmid:16947139
- 36. Stablein DM, Koutrouvelis IA (1985) A two-sample test sensitive to crossing hazards in uncensored and singly censored data. Biometrics 41(3): 643–652. pmid:4074816
- 37. Grambsch PM, Therneau TM (1994) Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 81(3): 515–526.
- 38. Klein JP, Logan B, Harhoff M, Andersen PK (2007) Analyzing survival curves at a fixed point in time. Stat Med 26(24): 4505–4519. pmid:17348080
- 39. Logan BR, Klein JP, Zhang M (2008) Comparing treatments in the presence of crossing survival curves: an application to bone marrow transplantation. Biometrics 64(3): 733–740. pmid:18190619
- 40. Cheng M, Qiu P, Tan X, Tu D (2009) Confidence intervals for the first crossing point of two hazard functions. Lifetime Data Anal 15(4): 441–454. pmid:19882351
- 41. Park KY, Qiu P (2014) Model selection and diagnostics for joint modeling of survival and longitudinal data with crossing hazard rate functions. Stat Med 33(26): 4532–4546. pmid:25043230