## Figures

## Abstract

An important objective in the development of targeted therapies is to identify the populations where the treatment under consideration has positive benefit risk balance. We consider pivotal clinical trials, where the efficacy of a treatment is tested in an overall population and/or in a pre-specified subpopulation. Based on a decision theoretic framework we derive optimized trial designs by maximizing utility functions. Features to be optimized include the sample size and the population in which the trial is performed (the full population or the targeted subgroup only) as well as the underlying multiple test procedure. The approach accounts for prior knowledge of the efficacy of the drug in the considered populations using a two dimensional prior distribution. The considered utility functions account for the costs of the clinical trial as well as the expected benefit when demonstrating efficacy in the different subpopulations. We model utility functions from a sponsor’s as well as from a public health perspective, reflecting actual civil interests. Examples of optimized trial designs obtained by numerical optimization are presented for both perspectives.

**Citation: **Ondra T, Jobjörnsson S, Beckman RA, Burman C-F, König F, Stallard N, et al. (2016) Optimizing Trial Designs for Targeted Therapies. PLoS ONE 11(9):
e0163726.
https://doi.org/10.1371/journal.pone.0163726

**Editor: **Robert K. Hills,
Cardiff University, UNITED KINGDOM

**Received: **May 10, 2016; **Accepted: **August 17, 2016; **Published: ** September 29, 2016

**Copyright: ** © 2016 Ondra et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **This is a methodological paper which is not based on empirical data sets.

**Funding: **MP, NS, TO, were funded by the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement number FP7 HEALTH 2013-602144 and CFB, FK, SJ under grant number FP7 HEALTH 2013-602552. The funder provided support in the form of salaries for authors SJ, FK, TO, MP, NS, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. CFB is an employee of AstraZeneca and was funded by grant number FP7 HEALTH 2013-602552 under a consultancy contract from AstraZeneca to Chalmers University for the purpose of conducting the work described herein. AstraZeneca did not play a role in the study data collection and analysis, decision to publish, preparation of the manuscript or financial support to CFB. The specific roles of these authors are articulated in the ‘author contributions’ section.

**Competing interests: ** I have read the journal’s policy and the authors of this manuscript have the following competing interests: RAB is stockholder in Johnson & Johnson. Furthermore he is Founder and Chief Scientific Officer of Oncomind, LLC. CFB is employee and stake holder of AstraZeneca. MP is head of the Center for Medical Statistics, Informatics, and Intelligent Systems that receives grants from industry. SJ, FK and TO have declared that no competing interests exist. This does not alter our adherence to PLOS ONE policies on sharing data and materials.

## 1 Introduction

In the development of targeted therapies the investigation of potentially predictive biomarkers is critical. If efficacy is limited to an identifiable subgroup of patients, developing a therapy for an unselected patient population is ethically problematic and will also require unnecessarily large sample sizes because of a diluted treatment effect. On the other hand, erroneously restricting a drug development program to a subpopulation is also unethical, as it excludes patients from an effective treatment. Furthermore, it will entail a financial loss for the sponsor because of unnecessary costs of biomarker development and screening and the lower prevalence of the future patient population.

Several one and two stage clinical trial designs have been proposed in which the treatment effect is tested in an overall population as well as in a subgroup of biomarker positive patients [1–6] (see [7] for a recent review). To account for the resulting multiple comparisons, tailored multiplicity adjustments have been developed [8–16]. Alpha allocation has also been optimized using interim trial data and/or data external to the trial, with respect to a utility function, providing an early example of the use of decision analysis [2].

In this paper we use a comprehensive decision theoretic approach to derive optimal trial designs for the development of targeted therapies. Especially, the framework allows us to assess when it is favourable to investigate the biomarker in a clinical trial and when it is actually more efficient to disregard the biomarker and to proceed with a classical trial design. This extends earlier decision theoretic methods that focused on the selection of the population for clinical trials incorporating a biomarker [17–22].

Consider a setting where a single potentially predictive binary biomarker has been identified in advance, separating the full population *F* into biomarker positive (*S*) and biomarker negative (*S*′) patients and there is prior evidence suggesting that the treatment effect may be more pronounced (or only present) in the biomarker positive group. Let *λ*_{S} and *λ*_{S′}, satisfying *λ*_{S} + *λ*_{S′} = 1, be the prevalences of biomarker positive and biomarker negative patients in the full population. For this situation we consider three design options for a pivotal clinical trial: (i) The *classical design* that does not account for the biomarker status and tests for a treatment effect in the full population only, (ii) the *stratified design* that also recruits patients from the full population but where the biomarker status of each patient is determined and the treatment effect is tested in the full population and the subpopulation, and, (iii) the *enrichment design*, where patients are screened for the biomarker status and only biomarker positive patients are included in the trial.

The choice of trial design will in general not only be based on power arguments, but on the overall expected utility of different designs, accounting for the potential rewards and costs. Rewards can be quantified by the sales revenue, from a sponsor’s view, or by a measure of the overall health benefit, from a public health view. The costs of the trial are determined by fixed and per patient costs as well as investments in biomarker development and the determination of the biomarker status for the patients in the trial. Based on a decision theoretic framework, we first optimize each of the three trial designs by choosing optimal sample sizes (and an optimized multiple testing procedure for the stratified design). Then, the optimal design can be selected among the three optimised designs based on their expected utilities. The optimal design choice depends on the type of utility function used (sponsor’s view or public health view), the reward and cost parameters, the prior distribution on the effect sizes and the prevalence of the biomarker positive subgroup.

## 2 Testing Problem and Considered Trial Designs

Let **Δ** = (*δ*_{S}, *δ*_{S′}) denote the treatment effects for the primary efficacy endpoint in the subgroup and its complement, respectively. Furthermore, let *π*(**Δ**) denote a prior distribution on **Δ**. We focus on priors that satisfy *π*(**Δ**) = 0 for *δ*_{S} < *δ*_{S′}. This accounts for settings where there is some evidence that the effect size in the biomarker positive treatment group may be larger than in the biomarker negative group but not the other way around.

For simplicity, we assume that the basis of marketing authorization is a single pivotal trial. We further assume that a necessary condition for regulators to approve a drug for the populations *S* or *F* is the demonstration of a significant treatment effect in the respective population by a suitable multiple testing procedure controlling the familywise error rate (FWER) at level *α* in the strong sense. Consider the two null hypotheses
where *δ*_{F} = *λ*_{S} *δ*_{S} + *λ*_{S′} *δ*_{S′}, and let, for some trial design *d*, *ψ*_{d} = (*ψ*_{S,d}, *ψ*_{F,d}) denote a multiple testing procedure such that *ψ*_{i,d} = 1 (0) if there is a statistically significant (no significant) treatment effect in population *i* = *S*, *F*.

We consider three types of trial designs, the classical, the stratified and the enrichment design. Let denote the set of considered trial designs, where are defined below:

**Classical design** C_{n} refers to a classical parallel group design with per group sample size *n* recruiting patients from the full population and testing *H*_{F} only. *H*_{F} is tested by a non-stratified test and we set .

**Stratified Design**
refers to a stratified design, which differs from the classical design in that analysis is stratified by the biomarker status and both hypotheses *H*_{F} and *H*_{S} are tested with a weighted multiple testing procedure with parameter *α*_{S}. As multiple testing procedure we apply the closed Spiessens-Debois’ test [8, 13]. This test combines the Spiessens-Debois’ test for the rejection of the intersection hypothesis *H*_{S} ∩ *H*_{F} with the closed testing principle so as to obtain a test for the rejection of either *H*_{S} or *H*_{F} (or both). Let *p*_{S} and *p*_{F} denote unadjusted p-values for testing *H*_{S} and *H*_{F}, respectively. Here we assume that *H*_{F} is tested with a test stratified for the biomarker (in contrast to the classical design, where a non-stratified test is used as no biomarker information is available). For *α*_{S}, *α*_{F} ≥ 0, the closed Spiessens-Debois’ test then rejects *H*_{S} if *p*_{S} ≤ *α* and either *p*_{S} < *α*_{S} or *p*_{F} < *α*_{F}. Similarly, it rejects *H*_{F} if *p*_{F} ≤ *α* and either *p*_{S} < *α*_{S} or *p*_{F} < *α*_{F}. To control the familywise error rate at level *α* in the strong sense, the significance levels *α*_{S} and *α*_{F} need to satisfy
(1)
Thus, the significance level *α*_{F} is determined by Eq (1) if *α*_{S} ≤ *α* is given. Note that the corresponding function *α*_{F}(*α*_{S}) depends on the subgroup prevalence *λ*_{S}.

We assume that in the stratified design, marketing authorization in the population *F* is not only determined by the treatment effect in *F*, but that regulators additionally require some evidence that there is a treatment effect in both *S* and *S*′, so that the rejection of *H*_{F} is not completely dominated by a treatment effect in a single subgroup only. Thus, we assume that the regulators’ decision rule corresponds to a hypothesis test where *H*_{F} is only rejected, if, in addition, the p-values *p*_{S} and *p*_{S′} of tests for efficacy in the two subgroups fall below corresponding thresholds *τ*_{S} and *τ*_{S′}. The resulting modified Spiessens-Debois’ test ( rejects *H*_{S} if {*p*_{S} ≤ *α*} ∧ {*p*_{S} ≤ *α*_{S} ∨ *p*_{F} ≤ *α*_{F}} and rejects *H*_{F} if {*p*_{F} ≤ *α*} ∧ {*p*_{S} ≤ *α*_{S} ∨ *p*_{F} ≤ *α*_{F}} ∧ {*p*_{S} ≤ *τ*_{S} ∧ *p*_{S′} ≤ *τ*_{S′}}. Note that this test is strictly conservative, because the consistency thresholds *τ*_{S} and *τ*_{S′} are not considered in the level *α* condition.

**Enrichment Design** E_{n} refers to an enrichment design, which differs from the classical design in that only patients from the subpopulation are recruited and only *H*_{S} is tested. In the enrichment design, *H*_{S} is tested by a test denoted by and we set .

## 3 Utility Functions

We define utility functions that quantify the potential rewards for each of the possible trial outcomes as well as the cost of the trial. To model the rewards, we distinguish between the sponsor and the public health view, leading to different utility functions for the two perspectives:
(2)
where *v* = *Sponsor* for the sponsor and *v* = *Public* for the public health view, *d* ∈ *D* denotes the trial design, the reward due to the trial outcome in subgroup *i* = *S*, *S*′ and * C_{d}* the cost of the trial. The cost functions

*of the different trial designs*

**C**_{d}*d*∈

*D*are sums of fixed costs and costs per recruited patient in the trial. Note that the per-group sample size

*n*may vary among the three designs and below we will determine optimal sample sizes for each type of design.

For the classical design the cost function is given by
where the setup costs of the trial *c*_{setup} are fixed costs and *c*_{per-patient} are the marginal costs per patient. In the stratified design there are additional fixed costs *c*_{biomarker} to develop the biomarker and additional per patient costs to determine the biomarker status *c*_{screening}. Thus, the cost function of the stratified design is given by
For the enrichment design the fixed costs are the same as for the stratified design. However, to recruit only biomarker positive patients one has to screen (on average) 2*n*/*λ*_{S} patients from the full population until 2*n* biomarker positive patients are identified. Given that the screening and determination of the biomarker status induces costs *c*_{screening} the cost function is given by

### 3.1 The Sponsor’s Utility Function

For the sponsor, the utility is the Net Present Value (NPV), which is defined as the reward (sales revenue) minus the trial costs. We model the sponsor’s reward as a function of (i) the outcome of the regulatory approval process, (ii) the price the sponsor can achieve, and (iii) the size of the population the drug is licensed for.

To model (i) and (ii) we define reward functions for *i* = *S*, *S*′ that specify the reward obtained in the respective population. The reward function may depend on the observed data, the design of the pivotal trial *d* and the prevalence of the subgroup. We model the reward as the product of the price of the drug for the treatment of a single patient times the market size. Given an overall market size *N*, the market sizes of the two subgroups are *λ*_{S} *N* and *λ*_{S′} *N*, respectively. Furthermore, we assume that the payers are willing to pay more if a larger treatment effect was observed. If the drug is authorized for neither subgroup, both reward functions are set to zero. If the drug is authorized for the subgroup *S* only, the reward for the complement *S*′ is set to zero. If the drug is authorized for the full population, we assume that the same price is charged in both subgroups.

We assume that (given that the respective hypothesis test rejects and the observed effect size exceeds a clinically relevant threshold) the price increases linearly with the observed effect size. Then the reward functions for the subgroups *S* and *S*′ are
(3)
where *μ*_{i} denotes a minimal clinically relevant effect size for population *i* = *S*, *F* and (⋅)^{+} denotes the positive part. and are the estimates of *δ*_{S} and *δ*_{F} obtained from the trial data. The constants *r*_{i} for *i* = *S*, *F* are the marginal prices (the change in price if the observed effect size increases by one unit) and *N* denotes the total market size, which for the sponsor is defined as the number of future patients within the patent life of the therapy in the unselected, full population. Note that, given that efficacy is shown in the full population, a common treatment effect estimate is used in the price function. Then the overall reward within the patent life of the therapy is given by .

### 3.2 Public Health Utility Function

With the public health utility function we model the utility of trial designs under the assumption that the drug is developed by public health authorities. Therefore, the utility of a trial is given by the total health benefit to the society (adjusted by the production cost of the drug) minus the cost of running the trial. We assume that the benefit of the drug is measured on a monetary scale representing the expected, accumulated (over the whole treated population) treatment effect. Costs are assumed to be the same as under the sponsor view. The reward functions for the subgroups *S* and *S*′ are given by
(4)

The first term in the utility function Eq (2) denotes the total benefits summed over the whole population, which are assumed to be proportional to the effect size (adjusted for a minimal relevant threshold), if the drug is authorized. The constants *r*_{i} for *i* = *S*, *F* are the marginal benefits (the change in benefit if the effect size increases by one unit), and *N* denotes the size of the future (unselected) patient population. Note that the benefit depends on the actual effect sizes *δ*_{i} and not on the corresponding trial estimates , implying that the benefit may be negative if the effect size is low. A consequence of this model choice is that a public health authority will take into account the risk of false positive approvals when optimizing its trial design. Such considerations are absent when a sponsor is optimizing, since we have assumed that only the estimated effects enter its utility function.

### 3.3 Optimizing the Expected Utility

Recall that *π* denotes a prior distribution on the effect sizes **Δ**. For a given utility function *U*^{(v)} and set of trial designs *D* the design optimizing the expected utility is given by
(5)
where
(6)
Note that the expectation is first taken over the data distribution given the effect sizes **Δ** and then over the prior distribution *π*.

## 4 Numerical Examples

We consider parallel group designs for the comparison of means of a continuous outcome. We assume that the responses in the control and experimental treatment arms *k* = *C*, *T* in subgroups *j* = *S*, *S*′ are normally distributed with mean *θ*_{k,j} and variance *σ*^{2}. However, utilizing the central limit theorem, the model can be modified to account for many other situations. The mean treatment effects in the two subgroups are given by *δ*_{j} = *θ*_{T,j} − *θ*_{C,j}. In the classical and the enrichment design non-stratified z-tests are performed to test *H*_{F} and *H*_{S}, respectively. In the stratified test, an elementary p-value *p*_{S} is computed from a z-test for *H*_{S} based on the observations in *S* and a p-value *p*_{F} from a stratified z-test for *H*_{F} stratifying by biomarker status. Then the closed Spiessens and Debois test is performed to adjust for multiplicity. We set *σ* = 1.

To be able to compute the expected utilities by numerical integration and not to have to rely on simulations, we approximated the sampling distributions for both the classical and the stratified designs by normal distributions (for the enrichment design the z-test statistic is exactly normally distributed). For the classical design, each subject recruited to the trial belongs to *S* with probability *λ*_{S}. Therefore, each observation in group *i* = *T*, *C* is with probability *λ*_{S} distributed as N(*θ*_{i,S}, *σ*^{2}) and with probability *λ*_{S′} distributed as N(*θ*_{i,S′}, *σ*^{2}). If the biomarker is either prognostic (such that *θ*_{C,S} ≠ *θ*_{C,S′}) or predictive (such that *δ*_{S} ≠ *δ*_{S′}) the overall treatment effect estimate for the classical design is not exactly normal, but, for sufficiently large sample sizes, approximately normal by the central limit theorem. Because the observations are drawn from a mixture distribution, the standard deviation of increases with the absolute differences |*θ*_{i,S} − *θ*_{i,S′}|, *i* = *T*,*C*. For simplicity, in the numerical examples we assume that the biomarker is predictive only but not prognostic (i.e., *θ*_{C,S} = *θ*_{C,S′}, see Appendix A for further details). For the stratified design, we assume that the subgroup estimates, and , are constructed as the sample means of exactly *λ*_{S} *n* (resp. *λ*_{S′} *n*) observations per group from the subgroups *S* and *S*′. However, if patients are not selected for the trial based on biomarker status, the number of subjects from each subgroup is binomially distributed, though, for large *n*, the random sample sizes have only little impact and the approximation becomes accurate. Therefore, in the numerical investigations, we introduced a minimal sample size of *n*_{min} = 50 patients per treatment arm. For the contour plot (in Subsection 4.1.3) optimization was performed by evaluating the objective function for a grid of candidate sample sizes (and *α*_{F} values for the stratified design). For the optimizations in the other plots, a further optimization step was applied by optimizing the objective functions with the R Version 3.2.4 procedure *optim* [23] using grid points as starting values.

The one-sided significance level is set to *α* = 0.025 and the consistency thresholds in the multiple test for the stratified design to *τ*_{S}, *τ*_{S′} = 0.3. We consider discrete prior distributions *π*_{δS,i,δS′,i} on a grid (*δ*_{S,i}, *δ*_{S′,i}),*i* = 1, …, *l* of effect sizes and specify two priors corresponding to scenarios where there is either only weak or strong prior evidence that the biomarker is predictive. The prior distributions used in the examples are defined in Table 1 and depend on an effect size parameter *δ*. In the examples below we set *δ* = 0.3 with the exception of Subsection 4.1.3 where optimal designs for other choices of *δ* are explored.

The constant *δ* > 0 parametrizes the effect sizes in the prior.

The reward and cost parameters in the sponsor and the public health utility function are specified via the following three cases:

**Case 1**Corresponds to a large market, where the biomarker costs are negligible, i.e.*Nr*_{S}=*Nr*_{F}= 10,000 Million US Dollars (MUSD) per unit of efficacy and*c*_{screening}=*c*_{biomarker}= 0.**Case 2**Corresponds to a small market, where the biomarker costs are still negligible, i.e.*Nr*_{F}=*Nr*_{S}= 1000 MUSD per unit of efficacy.**Case 3**We add biomarker and screening costs,*c*_{screening}= 5000 USD per patient and*c*_{biomarker}= 10 MUSD. The reward parameters*Nr*_{S}and*Nr*_{F}are the same as in Case 2.

For all three cases we choose *c*_{per-patient} = 0.05 MUSD and *c*_{setup} = 1 MUSD. Note that the setup costs are assumed to be the same for the enrichment, classical and stratified design and therefore have no impact on the order of their expected utilities. However, they do have an impact on the sign of the utility, and thus whether any trial design is superior to no trial at all. In the reward functions Eqs (3) and (4) we set the minimal clinical relevant thresholds to *μ*_{S} = *μ*_{F} = 0.1, which is a third of the effect size *δ* = 0.3 used in the prior distributions in Section 4.1.1 and 4.1.2.

### 4.1 Results

We discuss the optimal designs for the weak and the strong biomarker prior and the three cases specifying the cost and reward parameters.

#### 4.1.1 Optimization under the Weak Biomarker Prior.

**Large market, no biomarker costs (Case 1)** The optimized utilities and corresponding optimal classical, stratified and enriched designs are shown in Fig 1.

Optimized expected utilities and sample sizes for the enrichment, classical and stratified design as functions of the prevalence for *λ*_{S} ∈ [0.05, 0.95]. For the stratified design, optimized levels *α*_{S} and *α*_{F} for the multiple testing procedure are given. The last row shows the overall probability (averaged over the prior) that a significant treatment effect in *H*_{S} or *H*_{F} can be shown (and, for the stratified design, that the thresholds *τ*_{S} and *τ*_{S′} are crossed). The priors are defined as in Table 1 with *δ* = 0.3.

*Optimal utility.* For the sponsor utility function, the stratified design has the largest expected utility, except for low prevalences where the classical design is optimal. The latter is on first sight surprising, because in Case 1 we assume no biomarker costs. However, in the stratified design (in contrast to the classical design), to show efficacy in the full population, we require that *p*_{S} and *p*_{S′} do not exceed *τ*_{S} = *τ*_{S′} = 0.3 (in addition to rejection of *H*_{F} in the multiple testing procedure). Thus, for low prevalences the sample size of the stratified design needs to be substantially increased to reach a sufficient power to show efficacy in *F* and therefore its expected utility is lower.

For the public health utility function we observe a similar pattern. However, for large *λ*_{S} the expected utility of the classical design is almost identical to that of the stratified design. This holds because the power to reject *H*_{S} in the optimized stratified design approaches the power to reject *H*_{F} in the classical design and the rewards obtained for authorization in populations *S* and *F* are similar. Why is the stratified design for the sponsor view still optimal in this case? This results from the fact that the size of the reward in the sponsor view depends on the observed rather than the true treatment effect: for trial outcomes where *H*_{F} can be rejected in the classical design but, due to the variability of estimates, is large but is small (and thus ) the reward for a market authorization in *S* may become larger than the reward in *F*. However, while the classical design leads to rejection of *H*_{F} in such cases, the stratified design rejects *H*_{S} and not *H*_{F} because of the consistency threshold.

*Optimal sample size.* Overall, the optimized sample sizes for the public health utility function are larger than for the sponsor utility function. They are lowest for the enrichment design, and—for smaller prevalences—largest for the stratified design. For the latter, the sample size increases sharply for low prevalences. This is due to the fact that a sufficient sample size in the subgroup is required to achieve adequate power for the rejection of both *H*_{S} and *H*_{F} (for the latter due to the consistency threshold *τ*_{S}). Furthermore, the relationship of the optimal sample size and the prevalence is qualitatively different for the three designs. For both utility functions the optimal sample size is increasing in the prevalence for the enrichment design (because the gain when demonstrating efficacy in *S* increases), decreasing for the stratified design (because, as noted above, a sufficient sample size in *S* is required for the rejection of *H*_{S} and for the rejection of *H*_{F}) and non-monotone for the classical design (essentially because the effect size in population *F* is increasing in *λ*_{S} such that for small *λ*_{S} the expected utility does not sufficiently increase with the sample size to compensate the additional costs, while for large *λ*_{S} a smaller sample size is sufficient to achieve adequate power).

*Significance levels.* In the intersection hypothesis test of the optimal multiple testing procedure in the stratified design *α*_{S} is larger than *α*_{F} for almost all prevalences. To make up for the lower sample sizes in the subgroup, the optimal design uses a larger *α*_{S} than *α*_{F}. For increasing prevalences, the correlation of the test statistics used to test *H*_{S} and *H*_{F} increases such that less multiplicity correction is required and both *α*_{S} and *α*_{F} increase.

*Power.* We define the power corresponding to a specific trial design as the overall probability (averaged over the prior) of regulatory approval in any population. This is a slight generalization of the traditional concept of power, which in the current context may be defined as the probability of regulatory approval conditional on a specific pair of subgroup effects. The power obtained by averaging over a prior has also been referred to as *assurance* [24]. The curves shown in Fig 1 correspond to the optimal designs. It can be seen that the power is largest for the enrichment design, followed by the stratified and the classical design and that it increases with the prevalence.

Note that for the stratified design, the probability to obtain marketing authorization in *H*_{F} is largest for intermediate values of *λ*_{S} and much lower than for the classical design if *λ*_{S} is large (even though the optimized sample sizes are similar in this case). This is due to the application of the consistency thresholds which are a more difficult to meet if one of the subgroups *S* or *S*′ is small.

**Small market, no biomarker costs (Case 2)** Case 2 differs from Case 1 only in that the rewards *Nr*_{F} and *Nr*_{S} are reduced by a factor 10. Because of the lower rewards the optimized expected utilities are smaller compared to Case 1 (see Fig 2 for the expected utilities and optimized design parameters). They decrease even more than by a factor 10 as the trial costs are not reduced proportionally.

However, the optimized sample sizes (and consequently the overall probabilities to show efficacy in the respective populations) are substantially smaller than in Case 1. Overall, the expected utilities follow a similar pattern as in Case 1 but the range of prevalences where the classical design has a higher expected utility is larger than in Case 1 for both the sponsor and the public health utility functions. The assumption of a smaller market qualitatively changes the optimized sample size of the stratified designs as a function of the prevalence. For low prevalences the optimized sample size is much lower than in Case 1: because the reward is lower, it does not pay off to invest in a large overall sample size to meet the threshold *τ*_{S} in the subpopulation. This is also reflected in the optimized significance levels *α*_{S} and *α*_{F}, which give more weight to *H*_{F} than in Case 1.

**Small market with biomarker costs (Case 3)** Note that the addition of biomarker costs has no impact on the expected utility of the optimal classical design (as it does not require the biomarker). However, the expected utilities of the enrichment and the stratified design become smaller compared to Case 2 because of the additional costs. Therefore, the classical design now dominates the stratified design for a broader range of (small) values of *λ*_{S} and the stratified design becomes optimal only for larger values of *λ*_{S} (see Fig A in S1 File). In the public health view, the classical design dominates the stratified design also for very large values of *λ*_{S}: even though the classical design leads to lower expected rewards compared to the stratified design (since the latter is more likely to lead to market authorization for too large a population), this is compensated by the lower costs because no biomarker is required. In the sponsor view in contrast, the difference between the expected rewards of the stratified and the classical design is larger because it is determined by observed treatment effects (see also the discussion of expected utilities in Case 1, where a similar pattern is observed). Therefore, the stratified design dominates also for large values of *λ*_{S}. Moreover, the biomarker costs lead to a reduction in sample size compared to Case 2.

#### 4.1.2 Optimization under the Strong Biomarker Prior.

First, note that the expected utility and optimal sample size of the enrichment design is the same for the weak and the strong biomarker prior because the prior distribution on the treatment effect in *S* is identical in both.

**Large market, no biomarker costs (Case 1)** For the sponsor’s utility function the stratified design is still optimal, with the exception of very low prevalences (see Fig B in S1 File). In contrast, for the public health utility function, the enrichment and the stratified design have almost identical expected utilities unless the prevalence is small.

**Small market, no biomarker costs (Case 2)** While, as in Case 1, the stratified design is optimal for the sponsor view for all but very low prevalences, the difference between the expected utilities of the stratified and enrichment design is small.

In contrast, for the public health utility function the enrichment design achieves the highest expected utility (see Fig 3). Furthermore, for very low prevalences, none of the trial designs has a positive expected utility in the public health view and the optimal strategy is to perform no trial at all. For the sponsor view it is still optimal to perform a trial in the unselected population, albeit with the minimal sample size if the prevalence is small. This is due to the assumption that the NPV depends on the observed effect sizes, which implies that the sponsor benefits from a high variability of the treatment effect estimates.

Note that the optimal test in the stratified design gives most weight on *H*_{F} for low and on *H*_{S} for large prevalences. This holds for both the public health and the sponsor utility function.

**Small market, biomarker costs (Case 3)** The pattern is very similar to Case 2, however, the range of *λ*_{S} values where the classical (for the sponsor utility) or no trial (for the public health utility) are optimal becomes larger (see Fig C in S1 File).

#### 4.1.3 Optimized Designs for Varying Effect Sizes.

Fig 4 shows the optimal design as function of the prevalence *λ*_{S} and the effect size parameter *δ* which parametrizes the effect sizes in the weak and strong biomarker priors in Table 1. Under the sponsor view, either the classical or the stratified design is optimal while the enrichment design never maximizes the expected utility. Surprisingly, even for *δ* = 0, it is never optimal from a sponsor point of view to conduct no study at all in the scenarios investigated. This is due to the fact that a false positive, even though unlikely, may lead to a large reward. Therefore the optimal sample size is the minimal sample size *n*_{min} in these scenarios. This choice minimizes the costs and maximizes the variability of estimates.

Optimized designs for the sponsor and the public health authority are shown for both the weak and the strong biomarker prior (as defined in Table 1) under the three different cost structures defined by Cases 1, 2 and 3. The colour in a specific point indicates the type of the optimal design. Grey areas correspond to regions where all optimized designs have negative utilities, implying that the optimal choice is to perform no trial.

For the public health view in contrast, for very low effect sizes, the optimal decision is to perform no trial at all. Under the weak biomarker prior, the enrichment design is optimal under the public health view only in the scenarios without biomarker costs, for small *δ* and large enough prevalences (such that the population that will benefit from a new treatment in the future is large enough). For larger effect sizes the classical design is optimal for very low and very large prevalences and the stratified design otherwise.

Under the strong biomarker prior and intermediate *δ*, the public health utility is optimized by the enrichment design (unless the prevalence is too low and the classical design dominates). For larger *δ* the stratified design is optimal, again with the exception of very low prevalences. In addition, in the scenario with biomarker costs the classical design becomes optimal for large prevalences.

## 5 Discussion

The current study suggests decision-theoretic models for optimizing confirmatory biomarker trials, both from a sponsor and a public health perspective. Furthermore, it explores the potential discrepancies between the two perspectives.

The optimized designs depend sensitively on the particular configuration of parameter values. Besides the priors on the effect sizes, the assumptions on the market size and costs have a substantial impact on the optimized designs. Therefore, formulating simple rules of thumb for trial designs is hardly feasible. However, a few general observations can be made. The optimized sample sizes for the public health utility function are consistently larger than for the sponsor utility (assuming the same costs, market size and reward parameters *r*_{F}, *r*_{S} in both utility functions). This finding is likely due to the fact that sponsor benefit is based on the estimate of the benefit in the trial, whereas the public health benefit depends on the actual benefit. Thus the public health perspective implies a higher standard for the evidence. This finding provides a quantitative basis for the qualitative observation that health authorities tend to require a higher standard of evidence than desired by some sponsors. Mechanism design theory could potentially be applied to try to create mechanisms which align the incentives more completely.

Furthermore, for very low prevalences, the classical design outperforms the designs that are based on the biomarker. However, in these scenarios the expected utility for all designs can be negative in the public health perspective and so weakly positive in the sponsor perspective that the sponsor would allocate its resources elsewhere as well.

We find that in the sponsor view the enrichment designs never maximize the NPV in the considered scenarios. This is due to the fact that the sponsor may benefit from an authorization in the larger population even if the treatment is effective in the subpopulation only. For similar reasons, even under the global null hypothesis the strategy to perform a trial (with minimum sample size) gives a positive NPV in the sponsor view (a phenomenon that was observed also in other contexts [25]).

In the public health view the enrichment designs are optimal for a range of scenarios. Especially, if there is sufficient prior medical understanding that the biomarker negative subpopulation is unlikely to be positively affected by the drug, it can be a waste of resources to conduct the trial in this population. Ethical considerations reinforce this, as it can be argued that genuine informed consent [26] implies that patients should not be randomised if their expected utility is higher on standard of care than on randomised trial medication. On the other hand, in particular when subpopulations can be expected to be similar in efficacy, it is not always worthwhile to conduct biomarker screening. In fact, there is an increased risk in a stratified trial that the treatment is rejected in the biomarker negative subpopulation due to chance. Still, in situations with genuine uncertainty about the relative efficacy in the two subpopulations, biomarker determination and stratified designs may have a large value. An obvious extension of our model is to allow for trial adaptations, potentially closing the biomarker negative part of the trial at an interim, in case results are negative [27, 28].

When applying the presented framework to practical design decisions, the different model components should be scrutinized. In the numerical example we have assumed for simplicity that the biomarker is not prognostic but in practice this will often not be the case. If the biomarker is also prognostic, the variability of the effect size estimates will be increased with a consequent decrease in the expected utility of the classical design.

As regards the market size for the sponsor, *N* denotes the number of patients, determined by the patent life, for which full payment will be received upon regulatory approval. On the other hand, for the public health authority, *N* denotes the total number of future patients. In an extended model, *N* could be fixed to always be the total number of future patients and a factor could be added next to *N* for the sponsor. This factor would then represent the fraction of patients corresponding to the patent life, and could be made to depend on the choice of trial design in various ways. For example, in the enrichment design we accounted only for the screening costs arising from the determination of the biomarker status of patients. However, if the restriction of the trial population leads to slower recruitment and consequently a later authorization of the drug, the result will be a reduction of the market size and the remaining patent life. This, in turn, may reduce the potential reward in different ways for the two perspectives. Another simplification made in our framework is the assumption of a zero discount rate for the sponsor. In practice, a commercial sponsor would use a non-zero rate to discount future revenues, which would lead to a further reduction of its expected utility as compared to a public health authority.

In the considered model we assumed that the subgroup prevalences in the trial are the same as in the total patient population. However, unless the recruitment is stratified by subgroup, the actual prevalence in the trial will be stochastic. Furthermore, the propensity to participate in the trial may vary between subpopulations. While our results are generally robust to random variations in the prevalence, varying propensities for trial participation may lead to a biased estimation of the effect size in the full population in the classical design (and also the stratified design, if an overall effect estimate based on observed prevalences is computed). The question of generalizability of trial results to general patient populations is however not specific to the development of targeted therapies but a more general issue.

We did not explicitly incorporate a benefit risk evaluation of the treatment into the model. However, the parameters *μ*_{S} and *μ*_{F} in the reward functions (see Eqs (3) and (4)) can be interpreted as the minimal treatment effects that compensate the “costs” of the treatment, as the burden of treatment, side effects and monetary costs. While these are considered as given in our model, they could alternatively be estimated from clinical trial data.

We modelled the sponsor and public health utility as essentially linear functions of the observed and true effect size, respectively. From a commercial perspective this can be reasonable for scenarios where no alternative treatment options are available. However, if competitor products are on the market, the model may need to be modified because the market share, in terms of number of doses prescribed, and not only the price or benefit per patient may depend on the effect size. This can be incorporated by models where the market share is a function of the posterior distribution of efficacy (and possibly safety) parameters [29, 30]. Another aspect of our model for reimbursement concerns the pricing. Although NICE in the UK indicates that they, in our situation, would accept a price proportional to net benefit, payers in other countries may use other price models, possibly closer to a constant price. As an alternative to our linear sales model, an aggregated commercial model could be plugged in and similar optimization could be performed.

Finally, we note that the case of a very low prevalence, small market size and no biomarker costs mimics the situation of a rare disease, except there is no complementary subgroup *S*′. Therefore, our results could be seen to suggest that the investigation of rare diseases is not recommended in either perspective. Consequently, the question arises if research in rare diseases should receive special priority and be subsidised by society such that drug development occurs even though the expected utility to society is negative, or in some case weakly positive but less positive than other alternative expenditures. However, this argument raises ethical questions because the purely utilitarian viewpoint that underlies the decision theoretic framework does not account for other ethical principles as fairness and justice. Similar issues arise for small subgroups of common diseases, an increasing issue in cancer given the fact it being subdivided into many small molecular subclasses. In the case of cancer, increased benefit due to matching between molecular subgroups and targeted therapies may mollify this issue, but this remains to be seen in individual cases, so that the ethical and public policy dilemma may still be present.

## Appendix

### A Computation of Expected Utilities

We derive the expected utilities for a given effect size **Δ** for the enrichment, the classical and the stratified design. The overall expected utilities are then obtained by integrating over the prior distribution.

#### Enrichment Design.

For the enrichment design, . Thus, the utility is given by
Integrating over the resulting truncated normal distribution, the expected utility given **Δ** is given by
where *ϕ* denotes the density and Φ denotes the cumulative distribution function of the standard normal distribution, and
Similarly, for the public health view utility function we obtain

#### Classical Design.

In the classical design, , and

If the mean response in *S* and *S*′ differ, it follows that the observations in the experimental treatment and control group follow a mixture distribution of two normal distributions. Therefore, the variance of in the classical design is given by
Thus, the expected utility given effect sizes **Δ** for the classical design is given by
where *δ*_{F} = *λ*_{S} *δ*_{S} + *λ*_{S′} *δ*_{S′} and . Similarly, for the public health utility function,

#### Stratified Design.

The utility of the stratified design is given by
The utility of the stratified design depends on the stratified treatment effect estimate in the full population (in the following we shorten the notation by dropping the design index, ) which is a weighted sum of and . The expected utility given the effect sizes **Δ** is given by
and can be computed by numeric integration: Let *A*_{F}(*n*, *α*_{S}; *σ*, *α*, *λ*_{S}, *τ*_{S}, *τ*_{S′}, *μ*_{F}) denote the region in where and let *A*_{S}(*n*, *α*_{S}; *σ*, *α*, *λ*_{S}, *τ*_{S},*τ*_{S′}, *μ*_{S}) be the region where , where is the stratified treatment effect estimate and *Z*_{S}, *Z*_{S′} the z-statistics computed from the observations in *S* and *S*′ respectively. Then
and
The shapes of the regions *A*_{F} and *A*_{S} depend on the specific values of the parameters and the design variables (*α*_{S}, *α*_{F}, *τ*_{S}, *τ*_{S′} and *n*). However, the regions may in all cases be described by means of a finite number of straight lines, implying that the expected values above can be computed using standard software for numerical quadrature in two dimensions. But since the integrands are linear in *z*_{S′} and *Z*_{S′} follows a normal distribution, one-dimensional integration may be carried out analytically in the *z*_{S′}-direction before applying a numerical method. This leads to faster numerical evaluations, which is useful when investigating how the optimal solution changes over the parameter space.

For the public health view the expected utility given **Δ** may be written as
The numerical evaluation is similar to the evaluation of the conditional expectation of the utility of the stratified design for the sponsor’s view.

## Author Contributions

**Conceived and designed the experiments:**RAB CFB SJ FK TO MP NS.**Performed the experiments:**SJ TO.**Analyzed the data:**SJ TO.**Wrote the paper:**RAB CFB SJ FK TO MP NS.

## References

- 1. Mandrekar SJ, Sargent DJ. Clinical Trial Designs for Predictive Biomarker Validation: One Size Does Not Fit All. Journal of Biopharmaceutical Statistics. 2009;19(3):530–542. pmid:19384694
- 2. Chen C, Beckman RA. Hypothesis Testing in a Confirmatory Phase III Trial With a Possible Subset Effect. Statistics in Biopharmaceutical Research. 2009;1:431–440.
- 3. Freidlin B, McShane LM, Korn EL. Randomized Clinical Trials With Biomarkers: Design Issues. JNCI Journal of the National Cancer Institute. 2010;102(3):152–160. pmid:20075367
- 4. Mandrekar SJ, Sargent DJ. Design of clinical trials for biomarker research in oncology. Clinical Investigation. 2011;1(12):1627–1636.
- 5. Freidlin B, McShane LM, Polley MYC, Korn EL. Randomized Phase II Trial Designs With Biomarkers. Journal of Clinical Oncology. 2012;30(26):3304–3309. pmid:22869885
- 6. Ziegler A, Koch A, Krockenberger K, Grosshennig A. Personalized medicine using DNA biomarkers: a review. Human Genetics. 2012;131(10):1627–1638. pmid:22752797
- 7. Ondra T, Dmitrienko A, Friede T, Graf A, Miller F, Stallard N, et al. Methods for identification and confirmation of targeted subgroups in clinical trials: a systematic review. Journal of Biopharmaceutical Statistics. 2016;26(1):99–119. pmid:26378339
- 8. Song Y, Chi GYH. A method for testing a prespecified subgroup in clinical trials. Statistics in Medicine. 2007;26(19):3535–3549. pmid:17266164
- 9. Alosh M, Huque MF. A flexible strategy for testing subgroups and overall population. Statistics in Medicine. 2009;28(1):3–23. pmid:18985704
- 10. Bretz F, Maurer W, Brannath W, Posch M. A graphical approach to sequentially rejective multiple test procedures. Statistics in Medicine. 2009;28(4):586–604. pmid:19051220
- 11. Burman CF, Sonesson C, Guilbaud O. A recycling framework for the construction of Bonferroni-based multiple tests. Statistics in Medicine. 2009;28(4):739–761. pmid:19142850
- 12. Zhao YD, Dmitrienko A, Tamura R. Design and Analysis Considerations in Clinical Trials With a Sensitive Subpopulation. Statistics in Biopharmaceutical Research. 2010;2(1):72–83.
- 13. Spiessens B, Debois M. Adjusted significance levels for subgroup analyses in clinical trials. Contemporary Clinical Trials. 2010;31(6):647–656. pmid:20832503
- 14. Bretz F, Posch M, Glimm E, Klinglmueller F, Maurer W, Rohmeyer K. Graphical approaches for multiple comparison procedures using weighted Bonferroni, Simes, or parametric tests. Biometrical Journal. 2011;53(6):894–913. pmid:21837623
- 15. Millen BA, Dmitrienko A. Chain procedures: A class of flexible closed testing procedures with clinical trial applications. Statistics in Biopharmaceutical Research. 2011;3(1):14–30.
- 16. Alosh M, Huque MF. Multiplicity considerations for subgroup analysis subject to consistency constraint. Biometrical Journal. 2013;55(3):444–462. pmid:23585158
- 17. Beckman RA, Clark J, Chen C. Integrating predictive biomarkers and classifiers into oncology clinical development programmes. Nature Reviews Drug Discovery. 2011;10(10):735–748. pmid:21959287
- 18. Krisam J, Kieser M. Decision Rules for Subgroup Selection Based on a Predictive Biomarker. Journal of Biopharmaceutical Statistics. 2014;24(1):188–202. pmid:24392985
- 19. Götte H, Donica M, Mordenti G. Improving probabilities of correct interim decision in population enrichment designs. Journal of Biopharmaceutical Statistics. 2015;25(5):1020–38. pmid:24914474
- 20. Kirchner M, Kieser M, Götte H, Schüler A. Utility-based optimization of phase II/III programs. Statistics in Medicine. 2016;35(2). pmid:26256550
- 21. Krisam J, Kieser M. Optimal Decision Rules for Biomarker-Based Subgroup Selection for a Targeted Therapy in Oncology. Int J Mol Sci. 2015;16(5):10354–75. pmid:25961947
- 22. Graf AC, Posch M, Koenig F. Adaptive designs for subpopulation analysis optimizing utility functions. Biometrical Journal. 2015;57:76–89. pmid:25399844
- 23.
R Core Team. R: A Language and Environment for Statistical Computing; 2016. Available from: https://www.R-project.org/.
- 24. O’Hagan A, Stevens JW, Campbell MJ. Assurance in clinical trial design. Pharmaceutical Statistics. 2005;4:187–201.
- 25. Posch M, Bauer P. Adaptive budgets in clinical trials. Statistics in Biopharmaceutical Research. 2013;5:282–292.
- 26.
Burman CF, Carlberg A. Future Challenges in Design and Ethics of Clinical Trials. In: Pharmaceutical Sciences Encyclopedia. vol. 51; 2010. p. 1–28. https://doi.org/10.1002/9780470571224.pse250
- 27. Brannath W, Zuber E, Branson M, Bretz F, Gallo P, Posch M, et al. Confirmatory adaptive designs with Bayesian decision tools for a targeted therapy in oncology. Statistics in Medicine. 2009;28:1445–1463. pmid:19266565
- 28. Bauer P, Bretz F, Dragalin V, König F, Wassmer G. Twenty-five years of confirmatory adaptive designs: opportunities and pitfalls. Statistics in Medicine. 2016;35:325–347. pmid:25778935
- 29. Gittins J, Pezeshk H. A behavioral Bayes method for determining the size of a clinical trial. Drug Information Journal. 2000;34:355–363.
- 30. Kikuchi T, Pezeshk H, Gittins J. A Bayesian cost-benefit approach to the determination of sample size in clinical trials. Statistics in Medicine. 2008;27(1). pmid:17566967