The Continual Reassessment Method for Multiple Toxicity Grades: A Bayesian Model Selection Approach

Grade information has been considered in Yuan et al. (2007) wherein they proposed a Quasi-CRM method to incorporate the grade toxicity information in phase I trials. A potential problem with the Quasi-CRM model is that the choice of skeleton may dramatically vary the performance of the CRM model, which results in similar consequences for the Quasi-CRM model. In this paper, we propose a new model by utilizing bayesian model selection approach – Robust Quasi-CRM model – to tackle the above-mentioned pitfall with the Quasi-CRM model. The Robust Quasi-CRM model literally inherits the BMA-CRM model proposed by Yin and Yuan (2009) to consider a parallel of skeletons for Quasi-CRM. The superior performance of Robust Quasi-CRM model was demonstrated by extensive simulation studies. We conclude that the proposed method can be freely used in real practice.


Introduction
The primary goal of a phase I clinical trial of a new oncologic agent is to find a dose with acceptable toxicity, that is, to target the maximum tolerated dose (MTD). In practice, the MTD is often defined as the dose of the drug that will produce a defined doselimiting toxicity(DLT) in a pre-specified percentage of patients. Toxicity level is often categorized into multiple grades. For instance, the general guidelines of the Common Toxicity Criteria (CTC) (National Cancer Institute,2003) are grade 0 for no toxicity; grade 1,2,3,4 and 5 for minimal toxicity, moderate toxicity, severe toxicity, life threatening and death, respectively. This comprehensive toxicity grading scale is well-established and adopted in clinical practices, which indicates that a binary response may inappropriately ignore kinds of levels of toxicity severity. However, in most dominating dose allocation approaches, such as the traditional "3+3" design [1], CRM design [2] and recently proposed mTPI design [3], these grades are dichotomized. For example, if grade 4 fatigue is considered DLT then grades 0-3 will be non-DLT and treated almost equally from the point of view of a clinical trial design. It is known that such dichotomization works well for moderate toxicities. Nevertheless, for severe and possibly irreversible effects, such as renal, liver, or neurological toxicities, grade 4 renal toxicity is much more severe than that of grade 3.
Therefore, those toxicity grades should not be treated indiscriminately. In addition, given that Phase I trials are typically small in size, utilizing as much information as possible for decision making is important. Using only partial toxicity information could be inefficient. More appropriate methods need to be used to consider this issue in the dose escalation procedure.
In the literature, there have been some proposals for considering this issue. Bekele and Thall (2004) [4] (BT method for short) applied severity weights to a soft tissue sarcoma trial with five types of DLTs. Each observed patients was assigned by physicians to a severity weight on a common numeric scale for each type of toxicity, and the sum of these weights over the five toxicity types was called the total toxicity burden (TTB). The authors then considered a hypothetical collection of cohorts with a variety of different possible outcomes. Yuan, Chappell and Bailey (2007) [5] (their proposed method is named as Quasi-CRM) also used severity weights to convert toxicity grades to numerical scores. They proposed a Quasi-CRM approach to incorporate these scores into the CRM. The recommended dose for the next patient is the dose level with estimated score (the equivalent toxicity (ET) score) closest to the target score, obtained from a prespecified toxicity profile at the MTD. This Quasi-CRM method has been demonstrated to be superior to the BT method in recommendation percentage of optimal dose for further studies. Meter, Garrett-Mayer and Bandyopadhyay (2012) [6] incorporated toxicity grades using a continuation ratio (CR) model in the likelihood-based CRM. They demonstrated that the proposed method was better than that of dichotomous CRM counterpart.
In 2009, Yin and Yuan [7] proposed using multiple parallel CRM models, each with a different set of prespecified toxicity probabilities. In the Bayesian paradigm, they assign a discrete probability mass to each CRM model as the prior model probability. The posterior probabilities of toxicity can be estimated by the Bayesian model averaging (BMA) approach. Dose escalation or de-escalation is then determined by comparing the target toxicity rate and the BMA estimates of the dose toxicity probabilities. Yin and Yuan examine the properties of the BMA-CRM approach through extensive simulation studies, and also compare this new method and its variants with the original CRM. The results demonstrate that the BMA-CRM is competitive and robust, and eliminates the arbitrariness of the prespecification of toxicity probabilities. However, the BMA-CRM approach does not take the multiple toxicity grade level into account.
Although the Quasi-CRM method has good statistical performance, as in the CRM paradigm, the Quasi-CRM method only uses a pre-specified skeleton for the estimation of parameters, which could induce unstable estimators according to Yin and Yuan (2009). In this article we inherit the essence of BMA-CRM approach to incorporate it into the Quasi-CRM paradigm. We call our proposed design Robust Quasi-CRM. Numeric comparisons of Robust Quasi-CRM, Quasi-CRM are described in Section 3, followed by a conclusion in Section 4.

Equivalent Toxicity Score
The "equivalent toxicity (ET) score" was proposed by Yuan et al.(2007) to measure the relative severity of different toxicity grades in the dose allocation procedure, where grade 3 toxicities are assigned a value of 1, grades 2 are assigned to 0.5, and grades 4 are assigned to 1.5. They consider an ET score equal to 1 as the cutoff grade for DLT. By introducing the concept of ET score, the commonly used MTD definition will be modified to incorporate grade information. A new MTD is defined as the dose of the drug with ET score equal to the target ET score, computed from a prespecified toxicity grade profile at the MTD. Please also refer to Bekele and Thall (2004) for details.

Quasi-CRM
The quasi-likelihood function is constructed using a family of probability distributions that may not contain the true distribution. Estimators obtained by maximizing the quasi-likelihood function are called quasi maximum likelihood estimates (QMLEs). Under some regularity conditions, QMLEs are strongly consistent if the "quasi" distributions belong to linear exponential families such as the binomial family (Gourierox, Monfort, and Trognon, 1984; McCullagh and Nelder, 1989) [8] [9].
Obviously, their true distributions are not Bernoulli, which only takes values of 0 or 1. However, if the dose-toxicity model is correctly specifed, the QMLE will be strongly consistent because the Bernoulli distribution belongs to the binomial family.
Assume that the true dose-normalized ET score relationship is given by p ?
The goal is to find the MTD d 0 that is the highest level such that p ? s (d 0 )ƒp 0 =s max , where p 0 is the target ET score. Assume that the last patient is tested at level d(n)~d j with normalized ET score s ? (n). Then its contribution to the quasi-Bernoulli likelihood will be The quasi-Bernoulli likelihood will be updated by L n~Ln{1 yfd(n), s ? (n)g, and fp ? j g 1ƒjƒJ can be estimated accordingly. Note that if a functional dose-score curve is not assumed, the QMILEp p j~smaxp p ? j , j~1,2, Á Á Á , n, equals to the observed average ET score at each dose level.
The quasi-Bernoulli likelihood provides a simple way to incorporate ordinal grades into parametric models. Yuan et al.(2007) successfully used it with the CRM in developing the Quasi-CRM.

Bayesian Model Selection Method
As pointed out by Yin and Yuan (2009), a major issue associated with the CRM is that pre-specification of the toxicity probabilities (p 1 ,p 2 , Á Á Á , p J ) is arbitrary. If the p ' j s deviate far from the true dose-toxicity curve, this may lead to poor operating characteristics and a high probability of selecting the wrong dose as the MTD. To avoid subjectivity in specifying the skeleton, they proposed prespecifying multiple skeletons, each representing a set of prior estimates of the toxicity probabilities. During the trial, conditional on the observed data, these different models usually yield different estimates of the toxicity probabilities (p p 1 , Á Á Á ,p p J ). Some of these estimates may be close to the true values, whereas others may not, depending on how well the models fit the accumulated data. To accommodate the uncertainty in the specification of these skeletons, Yin and Yuan (2009) took a BMA approach to averagê p p j across the CRM models to obtain the BMA estimate of the toxicity probability for dose level j. BMA is known to provide a better predictive performance than any single model (Raftery, Madigan, and Hoeting 1997;Hoeting et al.1999) [11] [12].
Specifically, let (M 1 , Á Á Á , M K ) be the models corresponding to each set of prior guesses of the toxicity probabilities which is based on the kth skeleton (p k1 , Á Á Á , p kJ ). Let pr(M k ) be the prior probability that model M k is the true model; that is, the probability that the kth skeleton (p k1 , Á Á Á , p kJ ) matches the true dose-toxicity curve. If there is no preference a priori for any single model in the CRM case, then one can assign equal weights to the different skeletons by simply setting pr(M k )~1=K. At a certain stage of the trial, based on the observed data D~f(n j , y j ), j~1, Á Á Á , Jg, the likelihood function under model The posterior model probability for M k is given by where g k~p r(M k )=pr(M 0 ) is the prior odds for M k against M 0 , k~1, Á Á Á , K.
The Bayesian model selection approach can be used to estimate the toxicity probabilities and make the decision of dose assignment. Specifically, at each point of decision making for dose assignment, we select the model with the highest posterior probability, i.e., model k ?~a rgmax k[1,ÁÁÁ,K (pr(M K jD)) and use that model to make inference and dose assignment.
Unlike the Quasi-CRM, our proposed robust version pre-specifies a parallel of K different skeletons, f(p 11 , Á Á Á ,p 1J ), Á Á Á ,(p K1 , Á Á Á ,p KJ )g. Then after n patients, the quasi-posterior estimation of the toxicity probability for dose j under the kth skeleton will be updated bŷ . According to the BMA-CRM approach, in our proposed method, we also add a stopping rule in our algorithm, that is, if pr(toxicity rate at d 1 w )w0:9, then the trial is terminated for safety. We give our proposed approach the name Robust Quasilikelihood approach in later section. Here, we require early termination of a trial if the lowest dose is too toxic, as noted by pr(p k ? (a k ? )w jM k ? ,D)w90% Simulation studies

Simulation settings
We investigated the operating characteristics of the proposed Robust Quasi-likelihood approach through simulation studies under eight different toxicity scenarios. Table 1, the same as in Yuan et al., gave the probability configurations for grades 0-4 in each scenario. We considered six dose levels and assumed that toxicity increased monotonically with respect to the dose. We prepared three sets of initial guesses of the toxicity probabilities: The first skeleton started at a relatively moderate toxicity probability of 0.11 and increased quickly at the high toxicity probability of 0.85. The second skeleton was for the case in which toxicity increased slowly at the low doses but increased quickly at the moderate doses; the highest dose had a toxicity probability of 0.20. The toxicity probabilities in the third skeleton were scattered evenly over a range of 0.2 to 0.95. Thus these three sets of skeletons represented three different prior opinions on the true dose-toxicity curve. We refered to the individual Quasi-CRMs using each of these three skeletons as Quasi-CRM 1, Quasi-CRM 2, and Quasi-CRM 3.
In Table 2, under each scenario we listed the true toxicity probabilities in the first row, the corresponding ET scores in the second row, the percentages of MTD dose being correctly identified and the average numbers of patients treated at each dose separately for the Quasi-CRM using each of the three skeletons in rows 3-8, the results obtained using the proposed Robust Quasi-CRM in rows 9-10.
In the simulations, the target ET score was 0.47, which is equivalent to DLT probability of 0.33. That is, if we consider the following toxicity profile: 49% grade 0 and grade 1, 18% grade 2, 23% grade 3, and 10% grade 4, then the target ET score was obtained by computing the weighted sum of ET scores over all grades (i.e., R 0~0 :49|0z0:18|0:5z0:23|1:0z0:10|1:5~0:47.) All simulations began at the lowest dose and cohorts of one were treated at each stage. Dose escalation was restricted to the next higher pre-specified dose only. Each scenario was simulated 1,000 times with a maximum sample size of 20.

Simulation results
In scenario A, the fourth dose was the desirable dose, and the three individual Quais-CRMs using different skeletons selected the targeted ET score with very different probabilities. In particular, the proposed Robust Quasi-CRM correctly identified the MTD 46.6% of the time. Quasi-CRM 1 performed the best, correctly identifying the MTD 49.3% of the time and the Quasi-CRM 2 performed the worst, only correctly identifying the MTD 36.4% of the time. In this case, while the Quasi-CRM design was slightly better than the Robust Quasi-CRM, they were very comparable both in terms of correctly identifying the MTD as well as with Table 1. True probabilities of each grade (0/1,2,3,4) at each dose level (1-6) for eight simulation scenarios (A-H).  respect to the number of subjects who were treated above the MTD.
Scenario B had the MTD at the fourth dose level, and the MTD selection percentage using the Robust Quasi-CRM was the second best among the four designs. The worst skeleton corresponds to Quasi-CRM 2, only correctly identifying the MTD 40.3% of the time, whereas the proposed design correctly identified the MTD 56.9% of the time. In scenario C, the sixth dose was the MTD. Quasi-CRM 3 performed the worse in this scenario, with a MTD being correctly identified almost 50% lower than those of the others. In this case, Quasi-CRM 1's performance was also inferior to that of the proposed Robust-CRM method.
In Scenario D, the first dose is the MTD. Skeleton 1 correctly identified the MTD 70% of the time, while the proposed Robust-CRM correctly identified the MTD 73.1% of the time. With respect to the number of patients assigned to the above target ET score, our Robust design is the second best. Scenario E is similar to scenario A. In scenario F, all of the percentages of the MTD being correctly identified by using different designs were quite close, except the Quasi-CRM 2 has assigned more patients to the above target ET score than others. In scenarios G and H, again the proposed Robust Quasi-CRM was very robust, with a MTD selection percentage always close to that of the best-conducted Quasi-CRM.
These findings demonstrate that the skeleton indeed plays a critical role in the Quasi-CRM design. There was a difference of w 55% in the MTD selection probability when using different skeletons in scenario E. However, our Robust Quasi-CRM performed the second best, with MTD being identified around 90.2% of the time.
Based on these simulations, we conclude that the proposed Robust Quasi-CRM method are quite robust in terms of dose selection probabilities and average number of patients treated at the MTD level. These methods typically cannot perform as well as the best single Quasi-CRM in the skeletons set, but their performance is always quite close to that of the best single Quasi-CRM and can be much better than that of the worst single Quasi-CRM. The proposed method carries the essence of the BMA-CRM proposed by Yin and Yuan (2009) by adaptively balancing among competing models, and thus offers more reliable and robust estimates for the toxicity probabilities.

Conclusion
In this paper we proposed the robust version of Quasi-CRM to model toxicity grades, and demonstrated by simulation that it is superior to the single skeleton version of Quasi-CRM. As pointed by Yuan et al.(2007), the Quasi-CRM is most useful when DLTs are severe, possibly irreversible, or have a long duration.
The performance of the proposed designs can be substantially improved over that of the original Quasi-CRM if the skeleton in the CRM happens to be very far from the true model. The Robust Quasi-CRM method is straightforward to implement and to compute easily based on the Gaussian quadrature approximation or the Markov Chain Monte Carlo procedure. Our method requires specifying multiple skeletons to cover different potential scenarios for the underlying dose-toxicity curve. It provides a nice compromise for the initial guesses of toxicity probabilities from different physicians. If one skeleton corresponds to the true toxicity probabilities, then the Robust Quasi-CRM would perform very well, because it often performs similarly to the best-performing Quasi-CRM. This Bayesian model-averaging procedure dramatically improves the robustness of the Quasi-CRM. As shown in the Table 2. Cont. simulations, a certain skeleton often yields under-performing results; however, simultaneously specifying multiple skeletons reduces the likelihood of all sets of toxicity probabilities leading to a poorly performing Quasi-CRM design. The arbitrariness in the specification of the skeleton is eliminated by incorporating the uncertainties associated with each skeleton into the Bayesian model-averaging procedure.
In our simulations we used a cohort size of one; however, cohort size of two or three also could be used. Our setup is based on the improved versions of the Quasi-CRM to optimize its practical performance. As an extension of the Quasi-CRM, the Robust Quasi-CRM makes this trial design more widely applicable and reliable for phase I clinical trials.