Addressing Loss of Efficiency Due to Misclassification Error in Enriched Clinical Trials for the Evaluation of Targeted Therapies Based on the Cox Proportional Hazards Model

A key feature of precision medicine is that it takes individual variability at the genetic or molecular level into account in determining the best treatment for patients diagnosed with diseases detected by recently developed novel biotechnologies. The enrichment design is an efficient design that enrolls only the patients testing positive for specific molecular targets and randomly assigns them for the targeted treatment or the concurrent control. However there is no diagnostic device with perfect accuracy and precision for detecting molecular targets. In particular, the positive predictive value (PPV) can be quite low for rare diseases with low prevalence. Under the enrichment design, some patients testing positive for specific molecular targets may not have the molecular targets. The efficacy of the targeted therapy may be underestimated in the patients that actually do have the molecular targets. To address the loss of efficiency due to misclassification error, we apply the discrete mixture modeling for time-to-event data proposed by Eng and Hanlon [8] to develop an inferential procedure, based on the Cox proportional hazard model, for treatment effects of the targeted treatment effect for the true-positive patients with the molecular targets. Our proposed procedure incorporates both inaccuracy of diagnostic devices and uncertainty of estimated accuracy measures. We employed the expectation-maximization algorithm in conjunction with the bootstrap technique for estimation of the hazard ratio and its estimated variance. We report the results of simulation studies which empirically investigated the performance of the proposed method. Our proposed method is illustrated by a numerical example.


Introduction
On February 20, 2015, in his State of the Union Address, US President Obama announced the launching of a new Precision Medicine Initiative (PMI). As pointed out by Collins and Varmus [1], the near-term goal of the PMI is focused on cancers each of which has its own genomic signature driven by molecular lesions. In response to the PMI, starting in 2014, the US National Cancer Institute (NCI) launched a series of novel, molecularly guided trials which include the Exceptional Responders Initiative, NCI MATCH, ALCHEMIST trial and Master Protocol for second-line treatment of squamous lung cancer [2]. Most molecularly guided trials employ the enrichment design [3]. The enrichment design consists of two phases. The first phase is the enrichment phase during which each patient is tested for specified molecular targets using a diagnostic device that is capable of detecting them. Only those patients testing positive for the specified molecular targets are enrolled into the second phase in which they are randomly assigned either to the targeted treatment or to the concurrent control treatment. Fig 1 provides a graphical representation of the randomization schema of an enrichment design [4].
The enrichment design may be an efficient design for evaluation of targeted therapies. However, very few diagnostic devices have a perfect accuracy with 100% positive predictive value (PPV). In particular, the PPV can be very low for some rare diseases with low prevalence. Consequently, a sizable proportion of the patients tested positive during the enrichment phase may not actually have the molecular targets of interest. On the other hand, the targeted therapy is only designed to be efficacious for the patients actually having the molecular targets. The targeted therapy may not only be ineffective but also cause harmful adverse reactions in those patients who do not have the molecular targets despite testing positive for them. Liu, et al. [5] showed that the treatment effect estimated without consideration of misclassification error is underestimated for the true-positive patients with the molecular targets. Liu, et al. [5] and Liu and Lin [6] proposed inferential procedures, based on the expectation-maximization (EM) algorithm in conjunction with the bootstrap technique, to estimate the actual treatment effect of the targeted therapy for continuous and binary endpoints, respectively, under the enrichment design. For the censored endpoints, Chen, et al. [7] under assumption of a oneparameter exponential distribution, suggested a parametric method for the inference of the hazard ratio for the targeted therapy under the enrichment design.
Most of clinical trials for the near-term goal in the PMI are for evaluation of targeted therapies in treatment of various cancers with progression-free survival (PFS) or overall survival (OS) as the primary or secondary efficacy endpoints. The Cox proportional hazard (PH) model is the most commonly employed statistical method for inference of treatment effects which takes individual patient characteristics into account. Consequently, we propose an application of the discrete mixture modeling for time-to-event data proposed by Eng and Hanlon [8] to develop an inferential procedure based on the Cox PH model [9] for estimating the treatment effect of the targeted therapy for the true-positive patients with molecular targets. In our approach, we assume that the number of classes is two: true-positive patients and false-positive patients. Following Eng and Hanlon [8], we only assume that the hazards are proportional within each class. Combining the expectation-maximization (EM) algorithm with the bootstrap technique, we obtain the estimates of hazard ratios with their estimated standard errors. Because the assumption of a common proportional hazard is relaxed, we are able to obtain the estimated hazard ratio not only for the patients truly with the molecular targets but also for the patients truly without the molecular targets. Hence, we can perform the inference for targetby-treatment interaction.
In the next section, our proposed method for taking into consideration the diagnostic inaccuracy and uncertainty, based on discrete mixture modeling and the EM algorithm for the Cox PH model, is introduced with inferential procedures. The results of numerical studies, including numerical examples and simulation studies, are provided in the third section. Numerical examples illustrate the application of our proposed method in practical scenarios. Simulation results provide empirical performance of our proposed methods in terms of size, power and coverage probability of confidence intervals. Final remarks and discussion are presented in the last section.

Enrichment Design and Likelihood Function
We developed the proposed procedure for the following situations. First, a specified molecular target has been identified in the pathway of pathogenesis of the disease. Secondly, a validated diagnostic device is available for detection of the specified molecular target with a known estimated positive predicted value from validation studies. Third, this device is only for detection of the molecular target and is not prognostic for clinical outcomes of patients. A targeted therapy is currently being developed for true-positive patients with the molecular target. Finally, the enrichment design is employed to evaluate the treatment effect of the targeted therapy for the patients tested positive for the molecular target by the diagnostic device. Our objective is to develop statistical inference for the treatment effect of the targeted therapy for the true-positive patients with the specified molecular target.
Although under the enrichment design, all randomize patients have tested positive for the molecular target, due to inaccuracy of the diagnostic device, some positive patients still may not have the specified molecular target. Therefore, there are two latent classes of patients: the patients with the molecular target (true-positive patients) and the patients without the molecular target (false-positive patients). (Pepe [10]) By naïvely assuming no misclassification error, one can apply the Cox proportional hazards model for inference of treatment effect of the targeted therapy without adjustment for diagnostic inaccuracy. This approach is referred to as the naïve method.
For simplicity, we only consider one covariate Z for treatment identification in the Cox PH model. Z is 1 for the targeted therapy and 0 for the concurrent control. For the patients with the molecular target (+), let (y +i , δ +i , z +i ), i = 1,. . .,n + be an independent right-censored sample with right-censored survival time, y +i , covariates z +i , and censored indicator δ +i (δ +i = 1 if the event occurs; = 0 if censored). (y −i , δ −i , z −i ) is similarly defined for the patients without the molecular target (-), i = 1,. . .,n − . Denote the collection of right-censored times, censored indicator and covariate for all patients as Y ¼ ðy þi ; . . . ; y þn þ ; y Ài ; . . . y Àn À Þ ; D ¼ ðd þi ; . . . ; d þn þ ; d Ài ; . . . d Àn À Þ ; Z ¼ ðz þi ; . . . ; z þn þ ; z Ài ; . . . z Àn À Þ We further assume a Cox PH model separately for the patients with and without the molecular target. Hence, under the PH assumption, the hazard function for true target status h is given as where h 0h (t) and λ h are the baseline hazard and the treatment effect (regression coefficient: log hazard ratio) with treatment indicator z hi , respectively, for patient i with true target status h, for i = 1,. . .,n h ;h = +,−. Table 1 gives the baseline hazards and log hazard ratios between treatment groups for the true-positive and false-positive patients.
The hypothesis of interest for the true-positive patients with the molecular target is given as The density function of the PH model for right censored survival times for true target status h is given as where H 0h (y hi ) is the cumulative baseline hazard function for patient i with molecular status h, i = 1,. . .,n h ;h = +,−.
To accommodate the inaccuracy of the diagnostic device and the variability of its estimate, we introduce a latent binary variable X i for the true target status of patient i which is independently and identically distributed (i.i.d.) as a Bernoulli distribution with probability γ, where γ is the PPV of the diagnostic device, i = 1,. . .,N; N = n + + n _ . In other words, X i = 1 if patient i has the molecular target; = 0 if patient i lacks the molecular target and P(X i = 1) = γ = 1−P(X i = . .,N. The density function of (y +i , y -i ) given latent variable X i is given as f þ;À ðy þ ; y À ; d þ ; d À jz þ ; z À ; x i ; g; l þ ; l À Þ ¼ ½f þ ðy þi ; d þi jz þi ; l þ Þ x i ½f À ðy Ài ; d Ài jz Ài ; l À Þ Hence, the joint density function of (y +i , Let ψ = (γ, h, λ + , λ − ) denote the vector of unknown parameters, where h = {h 0+ (y +i ), h 0 − (y −i )} indicates the set of baseline hazard functions. It follows that the complete-data log-likelihood function for ψ is given as

Application of Discrete Mixture Modeling and the EM Algorithm
To obtain the maximum likelihood estimates of the treatment effect by the expectation-maximization (EM) algorithm, in the E-step, we need to find the conditional expectation of X i given it follows that in the E-step, the conditional expectation is given as In the M-step, the updated positive predicted value (PPV) γ is given aŝ Following Eng and Hanlon [8], the estimates of the baseline and cumulative baseline hazard functions as a function of hazard ratios are given below respectively: The estimation procedure of the EM-based approach is summarized as follows and then explained more fully step-by-step: (7), and update γ by Eq (6).

Repeat
Step 2 until convergence.
For the initial values, the estimated PPV from the validation studies can be used as the initial value of γ (0) . l 0 À can be set as 1 and l 0 þ is the hazard ratio specified in the protocol of the enrichment design. The standard error of the log hazard ratio for the true-positive patients with the molecular target is estimated by bootstrap procedures [11].
Step 1. Choose a large bootstrap sample size B. We would suggest B ! 1000. For 1 b B, generate the bootstrap samples ðy b obs ; d b obs ; x b obs Þ, according to the probability model in Eq (4). The parameters employed to generate bootstrap sample ðy b obs ; d b obs ; x b obs Þ are replaced by the estimators obtained from the EM algorithm based on the original observations from the targeted clinical trial.
Step 2. Estimatesl Ã þb are obtained by applying the EM algorithm to the bootstrap sample Step 3. An estimator of the variance ofl þ is given as The null hypothesis is rejected and the efficacy of the targeted therapy is found to be different from that of the control group for the true-positive patient population at the α significance where z α/2 is the α/2 upper percentile of a standard normal distribution, and S 2 þB denotes the estimated variance ofl þ obtained by the bootstrap procedure. The corresponding 100(1−α)% asymptotic confidence interval for λ + can be constructed asl þ AE z a=2 ffiffiffiffiffiffi ffi S 2 þB q : Hence the 100(1−α)% asymptotic confidence interval of the hazard ratio for the true-positive patients with the molecular target is given as exp½l þ AE z a=2 ffiffiffiffiffiffi ffi S 2 þB q :

Simulation Setup
We conducted extensive simulation studies to empirically investigate and compare performance of our proposed method with the current procedure in terms of relative bias, coverage probability of confidence intervals, size and power. A modified R function by Eng and Hanlon [8] was employed in the simulation studies. Random samples of patient units with or without the molecular target were generated from the Bernoulli distribution with probability γ. Then the units were randomized in a 1:1 ratio to the targeted therapy group or concurrent placebocontrol group. Although we used a semi-parametric model which does not require the actual distribution, we still need the distributional form for the data used in the simulation studies. We generated the one-parameter exponential random variable data with the specified parameters λ + , λ − according to the status of the molecular target. It is presumed that the placebo control is not efficacious in the patients either with or without the molecular target. In addition, the targeted therapy is presumed ineffective in the false-positive patients without the target. The pair of survival times and censored indicators (y, δ) was generated according to the methods described in Chen, et al. [7].
The following specifications of parameters were considered in the simulation studies. The PPV was set to be 0.5, 0.6, 0.7, 0.8, and 0.9, which reflect a range of low to high positive predicted value. The censored proportions considered in the simulation studies were 0, 0.1, 0.2, 0.3 and 0.4. To investigate the finite sample properties, the sample sizes were set as 300, 600, and 900 per group. The size was evaluated at the hazard ratio of 1. The power of the proposed procedure was investigated at hazard ratios of 0.70, 0.75, 0.80 and 0.85. For each of 300 combinations, 1000 random samples were generated and the number of the bootstrap samples was also set to be 1000.
For estimation, we investigated the bias of the estimators and the coverage probability of the 95% confidence interval. For hypothesis testing, the performance measures included empirical size and power. The bias was estimated as the average of the differences between the estimates and the true value of the hazard ratio over 1000 simulated samples. The coverage probability was calculated as the proportion of the 1000 95% confidence intervals that contain the true value of the hazard ratio. The size and power were computed as the proportion of the 1000 samples for which the null hypothesis (H 0 : λ + = 0 vs. H a : λ + 6 ¼ 0) was rejected for the two-sided test at the 5% significance level. For a 95% confidence level, a simulation study with 1000 simulated random samples implies that 95% of the empirical coverage probabilities of all combinations would be within 0.9365 and 0.9635 if the proposed method provides sufficient coverage probability. In addition, for the 5% nominal significance level, a simulation study with 1000 random samples implies that 95% of the empirical sizes would be within 0.0365 and 0.0635 if the proposed method can adequately control the size at the nominal level of 0.05.

Numerical Examples
We constructed a dataset for a hypothetical scenario based on the information provided by the US Food and Drug Administration (FDA) in the package insert of Herceptin 1 [12]. A targeted therapy is being developed for the treatment of patients with a certain cancer whose specific molecular target is over-expressed as measured by an immunohistochemical (IHC) assay. Suppose that the IHC assay has a PPV of 0.75. From previous studies, the hazard ratios for the patients truly with and without the target are 0.7 and 1.26, respectively. Under the enrichment design, 480 patients with positive test results were randomized in 1:1 ratio to receive the targeted therapy plus the standard chemotherapy or to the standard chemotherapy alone. The censored rate is assumed to be 30%. Table 2 provides the point estimates of the hazard ratio between the two groups, their standard error, and 95% confidence intervals. In addition, Breslow's estimates of baseline hazards were computed for the true-positive and false-positive patients when the PPV was set at 0.75 [13]. The baseline hazards of the true-positive and falsepositive patients were 0.0027 and 0.008 respectively. Therefore, the baseline hazard of the truepositive patients was lower than that of the false-positive patients.
When the PPV was 0.75, the naive approach without consideration of inaccuracy of diagnostic device and the variability of its estimate yields the estimate of hazard ratio for mortality of 0.8318 with a 95% CI from 0.6637 to 1.0425. Because the 95% CI does contain 1, the observed hazard ratio of death is not statistically significantly different from 1 at the 5% significance level. The targeted therapy therefore fails to prove its superior efficacy over chemotherapy alone. On the other hand, our proposed EM method resulted in an estimated hazard ratio of 0.7026. The 95% CI for the hazard ratio is (0.5299, 0.9315), which does not contain 1. As a result, the efficacy of the targeted therapy plus chemotherapy is concluded superior to chemotherapy alone for the true-positive patients with the molecular target at the 5% significance level. Because 25% of the patients do not have the specified molecular target, failure to take into consideration inaccuracy of the diagnostic device leads to underestimation of the hazard ratio based on the naive method by a magnitude of 18.39%.

Results of the Simulation Studies
The simulation studies provide empirical results of performance of our proposed method compared with the current approach with respect to relative bias, coverage probability, size, and power. Due to the large volume of results generated in the simulation studies, we only present the results concerning relative bias, coverage probability and empirical power for the combinations with sample size of 300. They are provided in Table 3, Fig 2 and Table 4, Figs 3 and 4, respectively. All other results are provided in the Supporting Information.  The simulation results show that the empirical estimates of the positive predicted values are close to the specified values for all the different combinations, with a maximal absolute difference of only 0.008. The simulation results on relative bias and coverage probabilities are given in Table 3 and S1 and S2 Tables. Graphical presentations of a summarization of Table 3, S1 and S2 Tables are given in Fig 2, S1 and S2 Figs, respectively. The absolute relative bias of the estimator of the hazard ratio for the true-positive patients obtained by the naive approach ranges from 1.27% to 19.84%. In comparison, the absolute relative bias of the estimator of the hazard ratio for the true-positive patients with the molecular target obtained by the EM algorithm is smaller than 2%. As PPV increases, the relative bias decreases. Overall, the relative bias Addressing Loss of Efficiency Due to Misclassification Error in Enriched Clinical Trials is a decreasing function of the hazard ratio. There is no apparent relationship between relative bias and sample size and between relative bias and censoring proportion.
The results concerning the empirical coverage probabilities of the 95% confidence intervals for the hazard ratios by the current method and the EM method are described below. Only 20 of the 300 coverage probabilities (6.67%) of the 95% confidence intervals by the naive method exceed 0.9365. The coverage probability by the naive method are as low as 0.546. On the other hand, there are 245 of 300 coverage probabilities (81.7%) of the 95% confidence intervals by the EM method exceeding 0.9365. However, 280 of the 300 coverage probabilities (93.3%) of 95% confidence intervals constructed by the EM algorithm are above 0.93. No coverage probability of the EM method is below 0.91. The coverage probability is an increasing function of the PPV and a decreasing function of the hazard ratio and sample size. In summary, the proposed EM procedure not only gives nearly unbiased estimated treatment effect for the true-positive patients with the molecular target but also provides sufficient coverage probability.
The simulation results on the sizes are given in Table 5. From Table 5, the empirical sizes for the naive method and EM method for all combinations are within 0.0365 to 0.0635. The results demonstrate that both the naive and EM method can adequately control the size at the nominal level of 5%. The results concerning the empirical power for the naive method and EM method are given in Table 4, S3 and S4 Tables, respectively. Graphical presentations of a summarization of Table 4, S3 and S4 Tables are given in Fig 4, S3 and S4 Figs, respectively. In addition, Fig 3 presents the power curves when n = 300 per group, censored rate is 10%, and the PPV is 0.6. In is clear from Table 2, S3 and S4 Tables, that the empirical power is an increasing function of the PPV and a decreasing function of the censoring rate and the hazard ratio. For both methods, the power increases as the sample size increases. The simulation results of the empirical power demonstrate that the proposed EM procedure is uniformly more powerful than the naive method. In summary, under the PH model the proposed EM procedure not only can better control the size at its nominal level but also is more powerful than the naive method.

Discussion
Although all patients enrolled into the enrichment design are tested positive for the specified molecular target, some of them may not actually have the target due to inaccuracy of the diagnostic device. The proportion of false-positive patients may be sizable for rare diseases with low prevalence rates. Consequently, the treatment effect of the targeted therapy may be severely underestimated for the true-positive patients with the molecular target. In addition, the magnitude of underestimation is a decreasing function of the PPV. The Cox PH model is the most frequently employed method for evaluation of targeted therapies for cancer trials with progression-free survival (PFS) or overall survival (OS) as the primary efficacy endpoint. We applied the method of discrete mixture modeling proposed by Eng and Hanlon [8], based on the Cox PH model, to develop an inferential procedure of estimating the treatment effect for the truepositive patients with the molecular target. Our proposed procedure employs the EM algorithm in conjunction with the bootstrap method not only to accommodate inaccuracy of the diagnostic device but also to take into consideration the variability of the estimated accuracy measure. Empirical evidence from extensive simulation studies demonstrates that our proposed method is nearly unbiased and provides sufficient coverage probability for the unknown hazard ratio for the true-positive patients with the molecular target. In addition, our suggested method can adequately control type I error at the pre-specified nominal level and is also uniformly more powerful than the naive method. Because of the discrete mixture modeling, the PH assumption is relaxed such that the hazards are only assumed to be proportional within each group of patients either with or without the molecular target. Since the PPV is not 100%, the log hazard ratio for the patients without the molecular target can be similarly estimated by the EM algorithm. Its estimated standard error can be also obtained by the bootstrap method. Inference regarding the efficacy of the target therapy can similarly be made to the patients without the molecular target. In addition, the patients with the molecular target are independent of the patients without the target. We can also make inferences about the target-by-treatment interaction, i.e., the difference in the treatment effect between the groups of patients with and without the molecular target. Denotê l À and S 2 ÀB as the estimated log hazard ratio and its estimated variance for the patients without the molecular target. A (1-α)100% asymptotic confidence interval for λ + −λ − is given by A confidence interval for λ + −λ − not only can be used to make inferences about the existence of target-by-treatment interaction but also can test whether the efficacy of the targeted therapy of the patients without the target is either equivalent or non-inferior to that of the patients with the molecular target.
Using the same dataset used in the Numerical Example, the estimated log hazard ratio for the patients without the molecular target is 0.259 with an estimated standard error of 0.3049. The corresponding estimated hazard ratio is 1.2956 with a 95% asymptotic confidence interval of (0.7179, 2.3551). It follows that a 95% asymptotic confidence for λ + − λ − is (-1.2728, 0.0049) which includes 0. The target-by-treatment interaction is, therefore, not statistically significant at the 0.05 significance level. However, the target-by-treatment interaction is significant at the 0.10 level because the 90% asymptotic confidence interval is (-1.1667, -0.0574), which does not include 0. Therefore, this analysis of the target-by-treatment interaction further demonstrates that even under the enrichment design our proposed procedure can evaluate whether the new targeted therapy is efficacious for false-positive patients if the PPV is not too high. Our proposed method only considers one covariate in the PH model, which is the treatment indicator. However, we can also incorporate other covariates such as gender, age, or disease status in the model. The full likelihood with more than one covariate is given as where z j and β j are the other covariates and their corresponding regression coefficients. Similarly, the regression coefficients can be estimated by the EM algorithm and their estimated standard errors can be obtained via the bootstrap technique.
Since no diagnostic tool is perfect with 100% PPV, our proposed EM procedure is based on the Cox PH model given two latent class memberships. However, we would be interested in the impact of violation of the two latent classes' assumption. An additional simulation study was conducted to assess the performance for the special case of PPV = 1.00. The simulation results on the relative bias of the estimator for hazard ratio by both EM and naïve methods are provided in Table 6. As expected, the naïve method performs slightly better than the EM procedure when a prefect diagnostic tool is available. In such a situation if the PPV is 100%, we suggest the use of the naïve method. We also note that relative biases by both methods do not exceed 5%. The case of PPV = 1.00 has little impact on the bias of the EM procedure. On the other hand, the simulation results reveal that differences between the naïve method and the EM procedure decrease when the PPV increases.
We conducted another simulation study to investigate the robustness of our proposed EM procedure when the true-positive data is generated from the Weibull distribution with a shape parameter of 2 and the false-positive data is from an exponential distribution. Table 7 shows the relative bias in estimating the hazard ratio when n = 300, the censoring rate = 30%, and the hazard ratio is 0.75 with the PPV ranging from 0.5 to 0.9 in increments of 0.1. It appears that under these conditions the naïve method performs slightly better than the EM procedure based  on the Cox PH model. This indicates that violation of the distribution assumption can lead to a biased estimate of the hazard ratios. To remedy such a situation, future work will extend our proposed EM procedure to proportional hazards models using the parametric Weibull model, which has hazard function hðtjzÞ ¼ pt pÀ1 expðlzÞ; for parameter p>0.
Supporting Information  Table. Relative bias (%) and coverage probability for n = 600 per group. (PDF) S2 Table. Relative bias (%) and coverage probability for n = 900 per group. (PDF) S3 Table. Comparison of empirical powers for n = 600 per group. (PDF) S4 Table. Comparison of empirical powers for n = 900 per group. (PDF)