Randomization in survival studies: An evaluation method that takes into account selection and chronological bias

The random allocation of patients to treatments is a crucial step in the design and conduct of a randomized controlled trial. For this purpose, a variety of randomization procedures is available. In the case of imperfect blinding, the extent to which a randomization procedure forces balanced group sizes throughout the allocation process affects the predictability of allocations. As a result, some randomization procedures perform superior with respect to selection bias, whereas others are less susceptible to chronological bias. The choice of a suitable randomization procedure therefore depends on the expected risk for selection and chronological bias within the particular study in question. To enable a sound comparison of different randomization procedures, we introduce a model for the combined effect of selection and chronological bias in randomized studies with a survival outcome. We present an evaluation method to quantify the influence of bias on the test decision of the log-rank test in a randomized parallel group trial with a survival outcome. The effect of selection and chronological bias and the dependence on the study setting are illustrated in a sensitivity analysis. We conclude with a case study to showcase the application of our model for comparing different randomization procedures in consideration of the expected type I error probability.


Introduction
One of the main purposes of randomization is to improve comparability between treatment groups by balancing observed and unobserved covariates in expectation [1]. Randomization furthermore helps to mitigate the risk of selection bias and, depending on the randomization procedure, can protect against imbalanced group sizes throughout the allocation process [1]. Despite the many benefits of randomization, there are also some limitations; for a comprehensive discussion, see [2]: One issue that cannot be addressed by randomization is that patients usually enter a clinical trial sequentially and are often treated immediately [1,2]. Consequently, new patients will be enrolled and assigned to therapies, while others have already received treatment [2]. This delay in time entails several potential sources of bias: On the one PLOS ONE | https://doi.org/10.1371/journal.pone.0217946 June 3, 2019 1 / 14 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 hand, the treatment success itself may be affected by unobserved time trends (chronological bias) [3,4]. These may result from, for example, improved treatment performance due to experience gain [5,6], or changes in inclusion or exclusion criteria [4,7]. On the other hand, the sequential enrollment creates the risk for selection bias whenever blinding cannot be fully attained [8,9]. There are many different randomization procedures for randomly assigning patients to treatments; for a detailed overview, see [1]. In general, randomization procedures that protect well against selection bias perform poorly with regard to time trends [10]. This is caused by the opposing behaviors of either forcing balanced groups or maintaining randomness: The more balance is forced by a randomization procedure, the more robust it is with regard to time trends [4] but at the same time the more susceptible it is to selection bias [11][12][13]. Therefore, there cannot be a universally best randomization procedure. Instead, the choice of a suitable randomization procedure always depends on the specific requirements of the study at hand [10]. In the case of two treatment groups and a normally distributed outcome, several models have been proposed to study the isolated effect of selection bias [12,[14][15][16] or the isolated effect of time trends [4]. Another model was introduced to compare more than two treatment groups in the presence of selection bias [17]. Recently, an approach for the combined assessment of chronological and selection bias in the case of a normally distributed outcome was published [10]. For survival outcomes, a single model has been proposed for the case where survival data follow an exponential distribution and an F-test is performed to compare two treatments [9]. This model is limited to the evaluation of selection bias only and does not account for time trends. In addition, the F-test is rarely considered in clinical trials, as it can only be used in very specific situations. Instead, methods that do not impose any distributional assumptions on the survival times are often used in practice [18][19][20]. Therefore, the previous evaluation method can only be applied to a very limited extent to most survival studies.
The objective of this paper is to provide a method for the evaluation and comparison of different randomization procedures that is applicable to most studies with a survival outcome and thus enable researchers to make a scientifically sound choice. We propose a semi-parametric bias model based on the hazard functions to model the simultaneous effect of selection and chronological bias. Based on this model, we derive an approximation formula that describes the impact on the distribution of the log-rank statistic [21,22]. The relevance to consider bias when selecting a randomization procedure and the dependence on the specific scenario is shown in a sensitivity analysis. We conclude with a case study to illustrate the application of our evaluation method to compare different randomization procedures if selection bias and time trend are anticipated. The performance of the different randomization procedures is evaluated by means of the expected type I error probability.

Prerequisites and setting
We consider a randomized, two-arm, parallel group trial where a control (C) and an experimental (E) treatment are compared with regard to a survival outcome. With an intended 1:1 allocation ratio, a total of n patients are enrolled over an accrual period of length A � 0. The maximum duration of follow-up is of length F, with F > A, so that all patients who have not yet had an event until then are regarded as right-censored. We assume that throughout the accrual period, patients enter the trial sequentially according to a uniform distribution. Furthermore, there is a random censoring mechanism that can be modeled by a probability distribution S cen which is independent of the survival distributions. Let d denote the number of observed events, where d � n, with the distinct ordered event times τ 1 < τ 2 < . . . < τ d .
We assume that patients are assigned to treatments according to a randomization sequence z = (z 1 , . . ., z n ) 2 {0, 1} n and treated immediately. Particularly, the ith patient who enters the trial is assigned to treatment z i 2 {0, 1}. If z i = 0, the respective patient is assigned to the control treatment, otherwise to the experimental treatment. The randomization sequence z can be interpreted as the realization of a random variable Z whose probability distribution depends on the randomization procedure used. The available randomization procedures can be classified into (I) unrestricted and (II) restricted randomization procedures. Among the restricted randomization procedures, there are (a) procedures with forced final balance [1] and (b) procedures with a maximum tolerated imbalance (MTI procedures) [23]. Within this paper we consider the following randomization procedures: (I). Complete randomization (CR): Patients are randomized to treatments by tossing a fair coin.
(I). Efron's biased coin design with bias probability p (EBC(p)) [24]: Patients are randomized to treatments by coin toss. The coin is biased with probability p in favor of the less frequently assigned treatment or fair whenever the same number of patients have been assigned to both treatments.
(IIab). Random allocation rule (RAR): Exactly n/2 patients are randomized to the control treatment and n/2 patients to the experimental treatment so that final balance is achieved with a maximum tolerated imbalance of n/2. All possible randomization sequences are equally likely.
(IIab). Permuted block randomization with block size k (PBR(k)): Patients are randomized in blocks of size k, with a random allocation rule being used within each block. The procedure achieves final balance and has a maximum tolerated imbalance of k/2.
(IIab). Berger's maximal procedure with final balance and a maximum tolerated imbalance of b (MP(b)) [25]: The set of randomization sequences generated by the random allocation rule is restricted to those for which the imbalance in group sizes never exceeds b. Among those sequences, all are equally likely.
(IIb). Big stick design with a maximum tolerated imbalance of b (BSD(b)) [26]: Patients are randomized to treatments by tossing a fair coin. If the imbalance in group sizes reaches a maximum tolerated imbalance b, the next patient is deterministically assigned to the treatment with fewer patients.
(IIb). Chen's biased coin design with a maximum tolerated imbalance of b and bias probability of p (CHEN(b, p)) [27]: Patients are randomized to treatments using Efron's biased coin design with bias probability p. If the imbalance in group sizes reaches b, the next patient is deterministically assigned to the treatment with fewer patients.
Let h C and h E denote the hazard functions associated with the control and experimental treatment. Assuming independent survival times and a constant hazard ratio over time, i.e., HR = h E (t)/h C (t) for all t > 0, the log-rank test [21,22] can be used to test the hypothesis that HR = 1. Conditioning on the number of patients at risk within the control and experimental group before each event, let O j denote the number of observed events and e j the number of expected events in the control group at the jth event time τ j for j 2 {1, . . ., d}. Under the null hypothesis, the log-rank statistic asymptotically follows a standard normal distribution [28], that is ðO j À e j Þ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Under the alternative hypothesis, the log-rank statistic asymptotically follows a non-standard normal distribution. Possible approaches to obtain approximations for the expected value and variance have for example been proposed by Schoenfeld (1981) [29] or Freedman (1982) [30].

Bias model
The following bias model is a generalization of the model described in [9] which also incorporates time trends and can be applied not only to exponentially distributed survival times. For a randomization sequence z, let h i (t, z) denote the hazard function of patient i at time t. We assume that the hazard function is affected by a biased selection of patients and an unobserved time trend such that where η i (z) and θ i are functions that quantify the strength of selection and chronological bias. In contrast to the additive bias model for continuous outcomes [10], this newly defined bias model for survival outcomes has a multiplicative effect on the hazard functions.

Selection bias
We consider the phenomenon of allocation bias [31], a specific type of selection bias, using the model assumptions from [14,15]: We assume that all previously randomized allocations are unblinded and therefore known to the investigator, as is the intended allocation ratio of 1:1.
Since knowledge of previous allocations may enable an investigator to predict upcoming allocations, this bears the risk for a biased selection of patients. Further, we assume that the investigator guesses upcoming treatments independently of the underlying randomization procedure, employing the convergence strategy by Blackwell and Hodges [14]. Let Z 2 R be a factor for the strength of selection bias. Based on the observed number of allocations to the control N C iÀ 1 ðzÞ and experimental N E iÀ 1 ðzÞ treatment before the enrollment of patient i, we assume that the investigator predicts the next treatment whenever there is an imbalance in group sizes. This translates into the following biasing strategy where sgn is the sign function. Note that for η = 0, there is no selection bias effect on the hazard rate from Eq (2). Assume that long survival is desirable. Then, for any randomization procedure targeting balanced group sizes, an effect of η > 0 reflects the situation where the hazard rate of control patients is worsened in expectation while the hazard rate of experimental patients is improved in expectation. Conversely, an effect of η < 0 equates a biasing strategy where the hazard rate of control patients is improved in expectation while the hazard rates of experimental patients is worsened in expectation.
The right choice of a selection bias function depends primarily on whether the blinding of the investigator can be guaranteed. If it can be excluded with certainty that the investigator will become aware of past treatment assignments, for example if enrollment and randomization are carried out by an external institution, it is permissible to assume no selection bias at all. However, this very strict assumption will not be fulfilled in most studies. If blinding cannot be implemented or is at risk, a selection bias function should be defined. The selection bias effect η characterizes the heterogeneity of the medical condition within the study population and should be chosen accordingly. An assessment of the heterogeneity can be made on the basis of clinical experience or the anticipated treatment effect and by taking into account the precision of the inclusion and exclusion criteria.

Chronological bias
There are multiple causes for chronological bias in the form of time trends, such as improved performance of the treating physician or staff (learning curve), or an amendment of the enrollment criteria [7,32]. Following [4], possible options for the time trend function θ i include stepwise, linear, or logarithmic time trends Stepwise . . . ; n À 1g; Linear where y 2 R is a parameter that quantifies the strength and direction of the underlying time trend. Assuming long survival is desirable, an effect of θ > 0 indicates that the hazard rate worsens over the course of the trial, while, conversely, θ < 0 indicates that the hazard rate improves over the course of the trial. For θ = 0 no time trend effect is assumed. The choice of a time trend function should be based on the anticipated temporal changes that could affect the treatment success. For example, the assessment of whether a learning curve can occur depends on the experience of the treating physicians as well as on the novelty of the method under investigation. The risk assessment for an amendment of inclusion criteria or a change in classification of a disease due to medical progress should be related to the length of the accrual period. After a time trend function has been selected, a suitable time trend effect θ must be specified. If data from similar studies are available, the time trend effect can be estimated from these observations. If no such data is available, the time trend effect can be quantified on the basis of medical experience or the anticipated treatment effect.

Influence of bias on the log-rank statistic
If the survival distributions of the patients are influenced by selection or chronological bias as introduced in our bias model from Eq (2), this also affects the distribution of the log-rank statistic. We derive an approximation formula to compute the rejection probabilities in the presence of bias depending on the randomization sequence. The dependence of the type I error probability on the sample size, the bias effects and the randomization procedure is showcased in a sensitivity analysis. We conclude with a case study to illustrate how our evaluation method can be applied to select a suitable randomization procedure on the basis of the expected type I error probability. All computations were performed on a computer with an Intel i7-4710MQ CPU Quad-core (2.5 GHz) and 16 GB RAM under a Windows 7 (64-bit) operating system. The code was written in R version 3.5.1. [33], using the randomizeR package version 2.0 [34].

Distribution of the log-rank statistic in the presence of bias
Conditioning on the realized randomization sequence z, we demonstrate that if ln(HR) is in Oðn À 1=2 Þ and (θ i + η i (z)) can be expressed in terms of ln(HR), the log-rank statistic is asymptotically normal with the mean depending on z and variance 1, i.e., LRðzÞ � asymp: N ðE bias ðzÞ; 1Þ: The arguments used are the same as those used by Schoenfeld [29]. Let S i and f i denote the survival distribution and density function corresponding to the ith patient where i 2 {1, . . ., n}.
The validity of Eq (3) can be shown by conditioning on the realized randomization sequence z = (z 1 , . . ., z n ), as well as the information as to which patients are at risk before event time τ j , j = 1, . . ., d. Assume that ln(HR) is in Oðn À 1=2 Þ and that the sum of the two bias functions (θ i + η i (z)) can be expressed in terms of ln(HR). As in Eq (1), let e j (z) denote the expected number of events at event time τ j in the control group under the null hypothesis and let μ j (z) be the true expected number of events under the bias model (2). Then, if Y 1 , . . ., Y n are the observed event or censoring times By rewriting μ j (z) as a function of ln(HR) and expanding in a Taylor series around zero, it can be shown the log-rank statistic convergences in probability to The first term from (4) is asymptotically standard normal, thus yielding the asymptotic variance of 1. To obtain the asymptotic value of the second term consider the two functions with S cen denoting the survival function of the random censoring mechanism and S uni the survival function of the censoring mechanism due to end of follow-up, i.e.,

1;
for t < ðF À AÞ; 1 À ðt À ðF À AÞÞ=A; for ðF À AÞ � t < F; 0; for t � F: Consequently, if selection or chronological bias are present and the equality of two survival distributions is investigated using the log-rank test, formula (3) can be used to obtain an approximation of the rejection probability depending on the randomization sequence z.

Sensitivity analysis
The influence of the randomization procedure on the rejection probability of the log-rank test in the presence of selection or chronological bias is shown for different bias effects and sample sizes to illustrate the dependence on the study setting. We limit our considerations to scenarios in which the null hypothesis is true, i.e., the rejection probability corresponds to the type I error probability. The scenarios are simulated by drawing a Monte-Carlo sample of randomization sequences for each randomization procedure. The scenario dependent type I error probability of each randomization sequence is then calculated using approximation (3). The number of randomization sequences per Monte-Carlo sample is chosen to ensure a precision of at least 2 � 10 −4 for the expected type I error probability with 99.5% certainty.
Two randomization procedures, PBR (8) and RAR, are compared in four scenarios with different sample sizes and bias settings. The comparison includes a small sample size of n = 40 and a large sample size of n = 200. We consider two different bias settings, one with a larger time trend and one with a stronger selection bias effect: The common assumptions regarding the study design on which all four scenarios are based are shown in Table 1.
The distributions of type I error probabilities of PBR (8) and RAR for the different scenarios are shown in Fig 1. On the one hand, there are clear differences between the different randomization sequences within a randomization procedure, as can be seen from the variation in type I error probabilities. On the other hand, there are also considerable differences between the two randomization procedures and between the different scenarios under consideration. The differences between PBR(8) and RAR are caused by the dependence of the rejection probability on the randomization sequence. This is because the set of randomly drawn randomization sequences depends on the probability distribution of the randomization procedure used. The differences between the SB-and CB-setting confirm that selection and chronological bias act differently on the type I error probability of the log-rank test. With regard to the sample size, the results of RAR seem to be similar while the rejection probability for n = 200 compared to n = 40 increases greatly in the case of PBR (8). Consequently, the rejection probabilities can also be affected by the sample size.
In summary, we observe that the presence of selection and chronological bias affects the type I error probability of the log-rank test. The extent of this influence depends on the type of bias, the sample size and on the randomization sequence. Due to the dependence on the randomization sequence, there is a dependence on the randomization procedure used. For this reason, the choice of the randomization procedure has a direct effect on the expected type I error probability of the log-rank test and thus on the susceptibility to selection and chronological bias. The results of the four scenarios show that, depending on the interaction of bias and sample size, a different randomization procedure should be chosen with respect to the expected type I error probability. Consequently, no general recommendation for a randomization procedure can be made and a suitable randomization procedure must always be chosen individually with regard to the study in question.

Case study for acute myelogenous leukemia
We illustrate the application of our bias model in a case study which is based on the aml data set from the boot package [35] in R version 3.5.1. [33]. The choice of the case study is motivated by our desire to allow researchers to reproduce our results. The metric considered for choosing a suitable randomization procedure is the expected type I error probability [15]. Our simulation is designed to ensure a precision of at least 2 � 10 −4 for the expected type I error probability with 99.5% certainty. This requirement results in 7,500 randomization sequences per randomization procedure. The rejection probability for each sequence is computed using approximation (3).
Setting. We consider a fictitious study that aims at assessing the efficacy of maintenance chemotherapy consisting of cytarabine and 6-thioguanine for acute myelogenous leukemia. The effect estimates are based on the differences observed in a study by Embury et al. [35,36]. The study population consists of patients with acute myelogenous leukemia (AML) who have never been treated for AML and who are in a state of complete remission after induction therapy. The primary outcome is the time a patient remains in complete remission (in weeks) and a difference between maintenance chemotherapy and control treatment (no maintenance chemotherapy) shall be detected with a power of 80% at a significance level of α = 5%. Using the aml data set, the estimated hazard ratio is HR = 0.4003. Under the assumption of exponentially distributed survival times, the hazard rate for the control treatment is estimated as λ C = 0.0431, which results in a hazard rate of λ E = 0.0173 for maintenance chemotherapy. The study is designed for an accrual period of 18 weeks and a total duration of 52 weeks. The common exponential dropout rate throughout the study is assumed to be λ cen = 0.0077, thus, resulting in a total required sample size of n = 64 patients (determined using nQuery Advisor version 7.0). A summary of the study design is shown in Table 2.
Considerations relating to the randomization procedure. To achieve a state of complete remission, all patients receive induction therapy before continuing with either maintenance chemotherapy or no further chemotherapy. It is suspected that, in addition to the maintenance chemotherapy, the induction therapy also has an influence on the time a patient remains in complete remission [36]. Furthermore, it was observed that the duration of induction therapy until complete remission could be significantly shortened by changes in treatment regime [36]. This indicates that there is a risk of a learning curve or other medical progress in induction therapy. To account for this uncertainty, we define a small logarithmic time trend effect: y i ¼ 0:125 lnðHRÞ lnðiÞ= lnðnÞ: This time trend function reflects the situation where the maximum difference in treatment success, attained between the first and last patient enrolled, is of magnitude one eighth as strong as the treatment effect (θ = 0.125 ln(HR)).
With regard to the definition of a selection bias function, the following considerations should be taken into account: The concealment of past treatment assignments is not possible, as it is impossible to conceal whether or not a patient received chemotherapy and a placebo chemotherapy would be unethical. Since the only inclusion criterion for patients is to be in a state of complete remission after induction therapy for acute myelogenous leukemia, the study population will be very heterogeneous. For this reason, a strong selection bias effect, one fifth as strong as the anticipated treatment effect is assumed (η = 0.2 ln(HR)): Z i ðzÞ ¼ 0:2 lnðHRÞ sgnðN E iÀ 1 ðzÞ À N C iÀ 1 ðzÞÞ: Evaluation of the randomization procedures. The mean type I error probabilities and standard deviations corresponding to each randomization procedure are summarized in Table 3. In general, the mean type I error probability is increased compared to the nominal significance level of 5%. The randomization procedures which display the smallest increase are CR, BSD (7), and BSD(11) with mean type I error probabilities of 5.2%. RAR, MP (11), and MP (7) perform only slightly inferior with mean type I error probabilities of 5.4%, 5.4%, and 5.5%, respectively. The increase is most severe for PBR(4) with a mean type I error probability of 8.1%. Based on the expected type I error probabilities, one of the randomization procedures CR, BSD (7), or BSD(11) is most suitable for the given survival study. Since the maximum loss of power due to imbalance is best controlled by BSD (7), this gives BSD(7) an advantage over CR and BSD(11).

Discussion
Although it is widely acknowledged by regulatory authorities that well-conducted randomized controlled trials yield a higher level of evidence compared to observational studies [37][38][39], the importance attributed to randomization itself is usually limited to good implementation rather than a well conceived design [10]. This is reflected in the predominant use of permuted block randomization [40], which is usually chosen without a sound explanation [7,23]. For this reason, we strive to strengthen awareness of the fact that the choice of randomization procedure affects the extent to which a study is susceptible to certain types of bias. This was confirmed both in our sensitivity analysis and in the considered case study. In order to enable scientists to select a randomization procedure that is suitable for their individual study, appropriate bias models and easy-to-use evaluation methods are needed. Our results extend previous evaluation methods for normally and exponentially distributed outcomes to the simultaneous consideration of selection and chronological bias in survival studies with an accrual and follow-up period as well as different types of censoring. In doing so, the consideration of selection and chronological bias is not necessarily limited to the bias model considered here. For example, it is possible to define different time trend functions to reflect other more complex study settings. Likewise, one can consider alternative biasing strategies, e.g., under the assumption that the investigator guesses upcoming allocations taking into account the conditional allocation probabilities of the employed randomization procedure The summary statistics are based on a sample of 7,500 randomization sequences per randomization procedure.
https://doi.org/10.1371/journal.pone.0217946.t003 [8,12]. The presented evaluation method addresses the influence on the distribution of the log-rank statistic, since the impact of bias on the test decision is one of the most important measures [7]. While the metric we used was the expected type I error probability, other evaluation metrics such as the power can also be considered for quantifying the performance of a randomization procedure [10]. The most commonly used methods for analyzing survival data are the Cox proportional hazards model and the log-rank test [19,20], which in turn is equivalent to the score test of the discrete Cox proportional hazards model when the treatment group is the only covariate considered [41]. The approximation formula can thus also be used to gain an impression of the impact of bias if the final analysis shall be carried out by using a Cox regression model. Therefore, the proposed evaluation method is applicable to a majority of survival studies. We have shown that the assessment of whether and to what extent selection and chronological bias pose a risk must be carried out individually for each survival study at the trial planning stage. For this reason, we recommend to always compare different randomization procedures before conducting a clinical study. This leads to an improved study design and thus serves the greater goal of increasing the level of evidence. By implementing our results in the R-package random-izeR [34], we provide a free software tool that makes such a comparison possible for everyone.

Conclusion
The presented results enable researchers in the planning phase of a survival study to make a scientifically sound choice of a randomization design. Due to the frequent use of the log-rank test and Cox's proportional hazards model, our approach is applicable in most scenarios.