Designing an exploratory phase 2b platform trial in NASH with correlated, co-primary binary endpoints

Non-alcoholic steatohepatitis (NASH) is the progressive form of nonalcoholic fatty liver disease (NAFLD) and a disease with high unmet medical need. Platform trials provide great benefits for sponsors and trial participants in terms of accelerating drug development programs. In this article, we describe some of the activities of the EU-PEARL consortium (EU Patient-cEntric clinicAl tRial pLatforms) regarding the use of platform trials in NASH, in particular the proposed trial design, decision rules and simulation results. For a set of assumptions, we present the results of a simulation study recently discussed with two health authorities and the learnings from these meetings from a trial design perspective. Since the proposed design uses co-primary binary endpoints, we furthermore discuss the different options and practical considerations for simulating correlated binary endpoints.


Introduction
The recent years have seen unprecedented challenges for many branches of modern medical research.The desire to accelerate development and approval of new treatments has called into question some long-standing drug development paradigms, such as the strict succession of phase 1, 2 and 3 trials and the insistence on separate trials for every experimental compound [1].Consequently, substantial effort has been made into the development of master protocol trials and in particular platform trials [2][3][4][5].These types of trials allow evaluation of many investigational treatments in parallel and hence their implementation has increased over the last years.The interest in platform trials has increased further with the emergence of the global pandemic due to the SARS-CoV-2 virus [6][7][8][9][10][11].However, many operational, logistical and statistical challenges around platform trials remain.
The definition of platform trials used in this article is that they are clinical trials which investigate multiple treatments or treatment combinations in the context of a single disease, possibly within several sub-studies for different disease sub-types or targeting different trial participant populations.In a platform trial, both drugs or drug combinations within existing sub-studies, as well as new sub-studies, may enter or leave the trial over time, allowing the trial to run infinitely, in principle.Within each sub-study, many adaptive and innovative design elements may be combined that clearly separate platform trials from more classical trial designs [4].For a more detailed introduction, we refer to Meyer et al. [2], where we conducted a comprehensive systematic search to review current literature on master protocol trials from a design and analysis perspective.A compact glossary of common terms related to platform trials can be found in Table 1, while a more detailed list of terms and explanations can be found online [12].
Platform trials can leverage their main strengths such as adaptive design elements, testing multiple hypotheses in a single trial framework, reduced time to make decisions, ease of incorporating new investigational treatments into the ongoing trial and possibilities for collaboration between different consortia/sponsors.In 2018, the Innovative Medicines Initiative (IMI) put forth a call for proposals for the development of integrated research platforms to conduct platform trials to enable more patient-centric drug development.A consortium of 36 private and public partners have come together in a strategic partnership to deliver on the IMI proposal goals; the project is called EU Patient-cEntric clinicAl tRial pLatforms (EU-PEARL) [13].Among the expected outputs of the initiative are publicly available master protocol templates for platform trials and four disease-specific master protocols for platform trials ready to operate in disease areas still facing high unmet clinical need; one of those diseases being non-alcoholic steatohepatitis (NASH).
NASH is a more progressive form of non-alcoholic fatty liver disease (NAFLD) and is estimated to affect approximately 5% of the world population.The disease is characterized by the accumulation of fat in the liver in the absence of significant alcohol intake or other secondary causes of hepatic steatosis [14,15].Over time, chronic inflammation and liver cell injury lead to fibrosis and eventually cirrhosis including complications of end-stage liver disease and hepatocellular carcinoma.Indeed, NASH complications are rapidly becoming the leading indication for liver transplantation.In addition, NASH is associated with higher risks of developing cardiovascular diseases, which is the primary cause of death for most people affected.Currently, there are no approved treatments for NASH in the US and EU and in recent years several compounds failed to meet their phase 3 primary endpoint(s) [16,17].However, developing treatments for NASH is a very active area of clinical research with dozens of industry-sponsored interventional studies active or recruiting trial participants across phases 1 through 3 with the vast majority in phase 1 or 2 according to ClinialTrials.gov and the EU clinical trials register (https://www.clinicaltrialsregister.eu).
To facilitate and accelerate the identification of the most effective and promising novel treatment options for trial participants with NASH, multiple potential novel therapies, as well as combinations of novel mechanisms of action, will need to be evaluated in well-designed early clinical studies before advancing to pivotal phase 3 programs.From a platform study perspective, Phase 2b is often the preferred trial design as it generally offers a robust pipeline for most indications and the ability to make decisions more rapidly before committing to longer, more costly development.This is particularly true for NASH where there is an abundance of compounds in early development and phase 3 programs tend to run over several years.Importantly, there are broadly common design elements, study populations, procedures, and endpoints for NASH phase 2b clinical studies which are aligned with Health Authority (HA) guidance.
Both the United States Food and Drug Administration (FDA) and the European Medicines Agency (EMA) have put forward advice for developing drugs for patients with non-cirrhotic NASH [18][19][20].Both HAs note that the risk of progression to clinical outcomes (i.e., both liver-related and non liver-related morbidity and mortality) is mainly related to fibrosis stage.Therefore, the non-cirrhotic NASH population that should be studied are individuals with either fibrosis stage 2 (F2) or stage 3 (F3) since they are at increased risk of progression relative to those with little (F1) or no (F0) liver fibrosis [21][22][23].In addition, the recognition by the HAs that the length of time necessary to observe a sufficient number of clinical events to assess drug efficacy may hamper drug development has led the HAs to recommend improvement in liver histology as clinical trial endpoints (i.e., resolution of steatohepatitis and no worsening of liver fibrosis, improvement in liver fibrosis greater than or equal to one stage with no worsening of steatohepatitis), which can be used as surrogates for approval in Phase 3 according to the accelerated approval pathways.Therefore, the FDA guidance advises that phase 2b studies demonstrate efficacy on a histological endpoint after at least 12-18 months of treatment, given that histological change takes an extended period of time to occur using a range of doses to support phase 3 dose selection.Therefore, members of EU-PEARL are currently developing a master protocol to support a phase 2b platform trial in NASH and this paper, as well as a previously published simulation study [24], describe the initiative's efforts to simulate the performance of the parameters used to make decisions on whether or not the treatment being evaluated is effective.Table 1: Glossary for important terms related to platform trials, taken partly from ICH E9 [25], partly EU-PEARL D2.1 [12].

Term Description
Adaptive Design An adaptive design allows the pre-specification of flexible components to the major aspects of the trial, like the treatment arms used (dose, frequency, duration, combinations, etc.), the allocation to the different treatment arms, the eligible patient population, and the sample size.An adaptive design can learn from the accruing data what the most therapeutic doses or arms are, allowing for example, the design to home in on the best arms.

Integrated Research Platform
An Integrated Research Platform (IRP) is a novel clinical development concept centered on a master trial protocol which can accommodate multi-sourced interventions using the existing infrastructure of hospitals and federated patient data in design, planning and execution, while an optimized regulatory pathway for these novel treatments has been assured.

Master Protocol
The term "master protocol" refers to a single overarching design developed to evaluate multiple hypothesis, and the general goals are to improve efficiency and establish uniformity through standardization of procedures in the development and evaluation of different interventions.Under a common infrastructure, the master protocol may be differentiated into multiple parallel sub-studies to include standardized trial operational structures, patient recruitment and selection, data collection, analysis, and management.In a platform trial the protocol will have the infrastructure to drop interventions and allow new interventions or combinations of interventions to enter the study based on decision rules in the master protocol.

Platform Trials
Clinical trials which investigate multiple treatments or treatment combinations in the context of a single disease, possibly within several sub-studies for different disease sub-types or targeting different trial participant populations.For more information, see section 1.

Multi-center Trial
A clinical trial conducted according to a single protocol but at more than one site, and therefore, carried out by more than one investigator.

Frequentist Methods
Statistical methods, such as significance tests and confidence intervals, which can be interpreted in terms of the frequency of certain outcomes occurring in hypothetical repeated realisations of the same experimental situation.

Bayesian Methods
Approaches to data analysis that provide a posterior probability distribution for some parameter (e.g.treatment effect), derived from the observed data and a prior probability distribution for the parameter.The posterior distribution is then used as the basis for statistical inference.

Interim Analysis
Any analysis intended to compare treatment arms with respect to efficacy or safety at any time prior to the formal completion of a trial.

Platform Design
An overview of the proposed platform trial design can be found in Figure 1.Generally, it is assumed that after an initial inclusion of a certain number of cohorts each consisting of treatment and matching control, further cohorts will enter over time while some of the existing cohorts might be discontinued for efficacy or futility.Trial participants entering the platform will be allocated between open cohorts.Within open cohorts, trial participants will be equally allocated between control and treatment arm using a block randomization of length two.Finally, the platform ends when all cohorts have finished their analyses.If the inclusion and exclusion criteria of the different cohorts are similar, it might be preferable to share the accumulating information on the control treatments, at least for concurrently enrolling trial participants.While there is a lot of controversy regarding the use of non-concurrent controls [26], sharing only information on trial participants that could have been randomized to the arm under investigation seems uncontroversial (note that this requires data to be concurrent).As noted before, platform trials can run perpetually without limiting the number of drugs going into the trial.Any potentially successful compound in a NASH phase 2b trial would have to show either resolution of NASH without worsening of fibrosis (binary endpoint 1) and/or 1-stage fibrosis improvement without worsening of NASH (binary endpoint 2).
Endpoints 1 and 2 are correlated binary endpoints and clinical studies have demonstrated a strong link between histologic resolution of steatohepatitis with improvement in fibrosis [27,28], therefore, improvement in endpoint 1 could lead to improvement in endpoint 2 but not necessarily the converse and not necessarily during the same time frame.The current regulatory guidance is that the FDA recommends demonstrating endpoint 1 OR endpoint 2 and the EMA recommends demonstrating endpoint 1 AND endpoint 2 [21][22][23].For this simulation study, we decided to follow FDA endpoint recommendations.Within EU-PEARL, several possible phase 2b platform trial designs for NASH were considered -treatment (one dose; could be monotherapy or combination therapy) versus control, treatment (multiple doses; could be monotherapy or combination therapy) versus control, combination therapy versus monotherapies versus control, etc.Furthermore, it was considered whether the final endpoints (which are observed after roughly 48-52 weeks) should be used for interim decision making or whether a short-term surrogate endpoint should be used.Based on the proposed design, comprehensive simulations were run for two scenarios: monotherapy (one dose) versus control and combination therapy versus monotherapies versus control.We will present results of the former in this paper and results of the latter can be found in Meyer et al. [24,29].

Decision Rules
Decisions on whether or not to promote treatments to the next stage of development can be based on different principles such as fixed thresholds for treatment effect estimates, the p-values of statistical frequentist tests for treatment efficacy, conditional or predictive probabilities of final trial success.Many readers might be familiar with group-sequential trials where early stopping for futility or efficacy is based on the p-values from statistical tests which are adjusted for repeated looks into the data, such as the O'Brien-Fleming test [30,31].
In some simple situations (e.g. if stopping the entire clinical trial for efficacy or futility is the only permitted interim decision option), it is possible to convert such decision rules into each other [32] (in the sense that a decision rule given by a threshold on conditional power can equivalently be stated by a correspondingly recalculated threshold on the estimated treatment effect, say).In platform trials, however, the decision space is usually more complicated and comprises interdependent decisions such as stopping arms without stopping the entire trial or selecting treatments if they are sufficiently superior to other treatments.In such situations, there is no simple 1-to-1 correspondence between decision rules formulated on different scales (e.g. a decision rule which is influenced by the treatment effect estimates from several treatments cannot be converted into a "Regimen" could be e.g.monotherapy, combination therapy, ... After an initial inclusion of two cohorts consisting of control (usually the standard-of-care, "SOC") and "regimen" arm (which could be a monotherapy or a combination therapy), more cohorts of the same structure are entering the trial over time.

Platform Design
Within each cohort, several interim and a final analysis are conducted using the co-primary binary endpoints "NASH resolution without worsening of fibrosis" and "Fibrosis improvement without worsening of NASH".The platform trial ends when all cohorts have been evaluated.
fixed threshold for one specific treatment).It is also very difficult to provide decision rules on un-standardized measures such as treatment effect estimates, since these would have to be derived anew for every concrete application.For these reasons, we focus on Bayesian posterior probabilities [5] as the main vehicle for making decisions in this paper.The benefit of using Bayesian decision rules is their flexibility regarding extensions to several criteria and interim analyses.To illustrate the basic mechanics, we introduce the concept for comparing the response rate of a new treatment (π E ) with the response rate of the Standard-of-Care (SoC) (π S ) in a clinical trial.For an analysis after observing data D, we are conducting a Bayesian analysis with the aim of deciding whether there is enough evidence to declare the treatment effective.First, we will introduce the concept for a Bayesian decision rule testing a single endpoint using a parsimonious notation for illustrative purposes.Later and in the appendix, we will show how the parameters could be specified for the endpoints at hand in NASH.Different levels of evidence will be introduced depending on the parameterization of the Bayesian decision rule.Typically, a Bayesian decision rule of the following sort could be used for comparing the new treatment to the control SoC (the priors on π S and π E are omitted for better readability): Declare Efficacy, if with some pre-specified probability threshold γ and pre-defined margin δ for the targeted treatment effect of interest.Such a decision rule based on a posterior distribution can, but does not have to, correspond to a particular null hypothesis (e.g.H 0 : π E > π S + δ).For example, it is sometimes appropriate to use so-called "shrinkage estimators" where the single treatment effect estimates in a platform trial are "shrunk" towards a common average effect.This is appropriate if drugs share a common mechanism of action and it is therefore a priori plausible that they may have similar effects.In such situations, the decision on a single drug is influenced by the performance of the entire class of drugs.For better readability, the dependence on the data D is omitted in following sections.
Decision rules should provide a high level of confidence that graduating compounds are competitive with respect to the current landscape of compounds in development with publicly accessible phase 2/3 studies published.In particular, semaglutide demonstrated a 42 percentage point response rate increase in NASH resolution (endpoint 1) as compared to placebo [33] and lanifibranor demonstrated a 19 percentage point response rate increase in fibrosis improvement (endpoint 2) as compared to placebo [34].In the process of eliciting which exact decision rules to use, we first conducted a review of studies in NASH.Several structured discussions were held between statisticians and clinical experts to define endpoints and targeted effect sizes.Finally, the experts provided confidence intervals based on which they would accept graduation of compounds from the platform trial.Confidence intervals are generally centered around the observed response rate and the width of the confidence interval describes the remaining uncertainty, which is directly linked to the sample size, i.e. larger sample sizes lead to narrower confidence intervals.The width of the confidence intervals the experts were presented with corresponds to a sample size of 75, which is the lowest treatment group size investigated in this simulation study and corresponds to a treatment group size usually used in NASH phase 2b trials.In frequentist decision making, one might tailor the decision rules such that the confidence interval does not include a certain lower bound of efficacy.The multi-component Bayesian decision rules we propose will allow for refined specification of evidence on the efficacy of a new compound required in order to graduate, while at the same time controlling basic type 1 error with one of its decision rule components.It should also be noted that our efficacy decision rules are based on checking the criterion that there is sufficient confidence that the effect size (i.e. the difference in response rate between the experimental and control treatment) exceeds (a) certain margin(s) with certain confidence(s), i.e., the posterior probabilities for decision rules as defined in equation 1.If the margin is selected close to the true (but unknown) effect size, there are limits for the achievable confidence, which in some situations seems counter-intuitive.As an example, if the true success rate is 0.5 and assuming a weakly informative prior, we will never be able to achieve a confidence greater than 50% that the true success rate is 0.5 or larger (for large sample sizes).The reason is that the posterior distribution will be centered around the true success rate of 0.5, i.e., resulting in a probability of maximum 50 % that the value will be equal or larger then 0.5.This is equivalent to achieving 50% power to detect a success rate of 0.5 if we require a confidence greater or equal to 50%.If indeed we wanted to detect a success rate of 50% with a larger power, we need to either reduce the required confidence or the targeted success rate in our Bayesian decision rules.This is illustrated further in Figure 8 in appendix A.1.3,where the resulting posterior distribution for a theoretical success rate of 0.5 is shown if a weakly informative Beta (1,1) prior and a sample size of 75 participants per group are chosen and the observed success rate equals the assumed response rate.Finally, communication of the chosen decision rules to clinicians and general audiences is not always straightforward, e.g. while graduating a compound if there is 20% confidence that the effect is sufficiently large (say ∆) is equivalent to dropping the compound if there is more than 80% confidence that the effect is smaller than ∆, the latter is much more generally understood.For the lack of a comparable frequentist design, no direct comparison of operating characteristics was conducted.
We propose a Bayesian framework for a multi-level efficacy decision rule which incorporates different levels of evidence, ranging from information whether the treatment is simply superior to the control up to information on how likely larger effect sizes of interest are.A specification for such multi-level efficacy decision rules using three levels of evidence can be found for both endpoints of interest in NASH in Table 2 and Figure 2. At level 1, the main target is to show whether the experimental treatment is superior to the control by setting the margin δ 1 = 0 .To ensure sufficient type 1 error control (assuming weakly informative priors on the success rates) the required confidence is set to γ 1 = 0.95.At level 2, there should be sufficient evidence provided that the true effect size is larger than moderate differences with a certain level of confidence.For example, we set δ 2 for each endpoint to an effect size which we elicited should be close to the lower end of any 95% confidence interval based on which a treatment would be graduated and the required confidence, γ 2 , to see an effect size at least as large as this to 85% (in accordance with our considerations regarding the confidence interval).Level 3 requires that there is also sufficient confidence in observing larger effects.For example we set δ 3 for each endpoint to an effect size which we elicited should be slightly below the center of any 95% confidence interval based on which a treatment would be graduated and the required confidence, γ 3 , to see an effect size at least as large is this to 60% (again in accordance with our considerations regarding the confidence interval).Of course any such ordering of δs and γs should fulfill the conditions δ 1 < δ 2 < δ 3 and γ 1 > γ 2 > γ 3 to be meaningful.To motivate the multi-level decision rules, we simulated all scenarios using the different levels of required evidence.Please note that while δ 1 < δ 2 < δ 3 and γ 1 > γ 2 > γ 3 , this does not mean that level 3 decision rules are generally "stricter" than level 2 or level 1 decision rules.In fact, if the posterior was extremely flat, the level 1 requirement would be the strongest.First level of efficacy evidence required serving as a threshold to establish sufficient confidence in any treatment effect larger than 0, i.e. superiority of a treatment to control.Note that when using non-informative priors, the one-sided frequentist type I error will be about 1 − γ for a single endpoint, i.e. 5% in this example.Second level of efficacy evidence required serving as a threshold to establish a high confidence that the true effect is larger than moderate treatment effects.In the example a higher margin for the first endpoint is required compared to the second endpoint.Third level of efficacy evidence required serving as a threshold to establish sufficient confidence in large treatment effects.
In addition to graduating a treatment based on (possibly multi-level) Bayesian decision rules, one might also be interested in dropping a treatment at an interim analysis for futility based on posterior probabilities.In this case, the Bayesian decision rule for a single endpoint can be expanded introducing margins δ T E and δ T F with thresholds γ T E and γ T F for graduating and dropping, respectively.At a given point in time T , after observing data D, we are conducting an analysis with the aim of deciding whether we have enough evidence to declare the treatment efficacious or futile.
Declare Efficacy, if with some pre-specified probability thresholds γ T E and γ T F and required treatment effects δ T E and δ T F .Since we decided to follow FDA requirements, a treatment will be graduated if the efficacy decision rules are met for at least one of the two endpoints (this will be referred to as the "OR" decision rule), while it will only be dropped at interim for futility, if the futility decision rules are met for both of the endpoints.The specification of the efficacy decision rules for both endpoints is given in Table 2, whereby the same margins δ and confidence γ are used for the interim and final analyses.The trial will be stopped for futility, if there is only a small likelihood that the response rates of the treatment arm exceeds the control arm by at least 25 and 10 percentage points for endpoint 1 and 2, respectively.For endpoint 1, both efficacy and futility decision rules are illustrated in Figure 2.For more information, including a verbal description of the decision rules and a more formal definition, please refer to appendix A.1.

Simulation Setup
For classical randomized-controlled trials (RCTs) or even some multi-arm, multi-stage trials, the required sample size to achieve a certain power might be a deterministic function of several assumptions and design parameters such as treatment effects, significance level and chosen test procedure.Due to the complex designs, platform trials usually require simulations to be run in order to calculate operating characteristics such as power and average trial duration.In order to simulate the EU-PEARL NASH phase 2b platform trial, we chose a set of assumptions and design choices that were fixed and a set of assumptions and design choices that were varied (see Table 3).In general, for every trial participant we observe two correlated binary outcomes, NASH resolution (Endpoint 1) and fibrosis improvement (Endpoint 2).Outcomes are simulated using the approach described in section A.2.4, such that in the simulations we can fix the success rates of endpoint 1 and endpoint 2 and their latent variable correlation ρ (see Figure 3).The correlation between the two endpoints can be interpreted in such a way that if there is a positive correlation between the two endpoints, then there is an increased likelihood that either events are observed in both endpoints jointly or not at all.If there is a negative correlation, it means if an event is observed in one endpoint, then there is a larger likelihood that no event is observed for the other one.If the two endpoints are uncorrelated (ρ = 0), knowing if an event was observed for one endpoint gives no information as to whether or not an event is observed for the second one.Sample sizes reflect number of trial participants with complete observations (i.e.paired biopsies) by the time of final analysis.After this number of trial participants were enrolled in a given treatment arm, enrollment to this treatment arm stops.Simulations of this trial design were performed using the cats package, which is downloadable on Github (https://github.com/el-meyer/cats)and CRAN (https://cloud.r-project.org/web/packages/cats/index.html).For each of the distinct combinations of simulation parameters the platform trial was simulated 10000 times.Results of those 10000 simulated platform trial trajectories were summarized for each of the sets of simulation parameters and visualized using lattice plots [35].In particular, we present the success probability (i.e. the probability for a particular drug to be declared superior to control; this is equivalent to type 1 error when the drug is in truth futile and power when the drug is in truth efficacious).

Treatment better
Level 1: We need very high confidence (e.g.,  1 = 0.95) that treatment is better than control, regardless of magnitude of effect ( -60

Confidence γ required
Figure 2: Schematic overview of decision rules used.On the x-axis, the difference in response rates between the control and treatment group (i.e.treatment effect) in percentage points is shown.At the two interim analyses, cohorts can be stopped early for futility, if there is very little evidence (interim analysis 1: less than 20%, interim analysis 2: less than 30%) that the treatment is better than control by 25 percentage points or more (red box).At all analysis time points, the same efficacy decision rules are used (blue boxes).Depending on the aim of the study, all or only certain levels of evidence could be required (see also Table 2).The treatment effects (δs) presented in this figure correspond to the decision rules used for endpoint 1for endpoint 2, we used δ = 0, δ = 0.175, δ = 0.25, as well as a futility margin of percentage points.
Figure 3: Impact of different levels of correlation between endpoints 1 and 2 on the expected number of responders.Assuming a response rate of 30% for endpoint 1 and 40% for endpoint 2, we expect 30/100 patients to reach endpoint 1 (in red) and 40/100 patients to reach endpoint 2 (in yellow), regardless of the correlation.Depending on different levels of the correlation, the number of responders that reach both endpoints simultaneously (in orange) varies; it increases with increasing correlation.In contrast, the expected number of trial participants that reach at least on of the two endpoints decreases with increasing correlation (in this example, this number is 69 when the correlation is -0.3 and 53 when the correlation is 0.7).

Main Simulation Results
Trial success probabilities with respect to the chosen simulation parameters are shown in Figure 4.It becomes apparent that when the drug under investigation is not effective for either of the endpoints (i.e.responder rate of 10% for endpoint 1 and 25% for endpoint 2), the success probability (i.e. in this case type 1 error) is negligible (about 0.1%) regardless of the sample size.On the other hand, when the drug is highly efficacious for either or both of the endpoints (i.e.responder rate of 55% for both endpoints), the success probability (i.e. in this case power) is close to 1.When the treatment under investigation exhibits a responder rate of 35% for one or both endpoints, the success probability is below 20% and in case of a larger sample size (125 per arm) below 10%.When the treatment under investigation is promisingly efficacious on both endpoints (i.e.responder rate of 45%), the success probability is between 60-70%, depending on the sample size and correlation.In terms of data sharing we observe the same pattern as with increased sample size -when the treatment is highly efficacious, the success probability increases, otherwise it decreases -this is a feature of the Bayesian decision rules (in frequentist analyses we would expect the type 1 error to be the same).
As an example, when the sample size is 125 per arm, there is no correlation between the two endpoints and the success rate is 35% for both endpoints, sharing data reduces the success probability (in this case corresponding to type 1 error) to 5%, while the success probability is 8% when not sharing data.In general, we see a difference in success probabilities when only one of the two endpoints is reached -the success probability is higher if the treatment is efficacious on the Fibrosis endpoint.This is intended and consistent with our assumption that NASH resolution leads to Fibrosis improvement.Higher correlations between the two endpoints lead to reduced success probabilities -this is explained via the "OR" decision rule and the fact that outcomes are simulated using a latent bivariate normal distribution.This effect is most pronounced when the treatment is moderately efficacious for one or both of the endpoints (i.e.responder rate of 45%).
Average platform trial durations (i.e.time until a decision is made for the last investigational treatment) are shown in Figure 5.It becomes apparent that when the treatment is weakly efficacious for either or both of the endpoints (i.e.35% responder rate), trial duration is the longest (since it is unlikely that the treatment will be stopped for efficacy or futility at any of the interim analyses).For similar reasons, trial duration is decreased when the treatment is efficacious for neither of the endpoints and shortest when the treatment is highly efficacious for both of the endpoints.Generally, sharing concurrent data can lead to savings in trial duration of at most 2 weeks (sample size per arm 75; bottom left panel in Figure 5; 163 vs 165 weeks) or 6 weeks (sample size per arm 150; top left panel in Figure 5; 212 vs 218 weeks) compared to not sharing data (please note these numbers are also influenced by our assumption that new treatments would enter the platform every 24 weeks and the platform would necessarily run until five treatments are evaluated).Please note that since there is a lag of observing the final endpoints of 52 weeks, even if a decision is made early, this might not translate to savings in trial participants.In case the sample size per arm is 75, we observed no savings in terms of trial participants enrolled (i.e.always 750 trial participants are enrolled in the course of the trial).This is due to the assumed recruitment rate, which would lead to full recruitment in the time frame before the first interim analysis (a slower recruitment rate or more treatment arms investigated simultaneously might lead to savings).When the sample size per arm is 150, analogously to average trial duration, we observed savings in trial participants when treatments are overwhelmingly efficacious or futile (in the most extreme case of overwhelming efficacy on both endpoints and sharing data, approximately 382 out of 1500 trial participants were saved, compared to if no early stopping rules had been in place).
Probabilities of stopping early with respect to treatment efficacy are shown in Figure 6.In case the treatment is not better than placebo, the futility rules eliminate approximately 60% of treatments at the first interim analysis and approximately 80% by the second interim analysis.This is true when the sample size per treatment arm is 75 and the probabilities increase further with increased sample size, i.e. 125.When the treatment is highly efficacious on both endpoints, the efficacy decision rules graduate most of the treatments at the first or second interim analysis.We also see that the futility interim decision rules eliminate more treatments that are only weakly effective on endpoint 1 and ineffective on endpoint 2 than if the reverse was true (this is a result of identical futility stopping rules while the standard-of-care response rate is lower for endpoint 1 than for endpoint 2).

Impact of Bayesian Multi-Level Decision Rules
Trial success probabilities with respect to the chosen simulation parameters as well as the level of evidence required in the Bayesian decision rules (see section 2.2) are shown in Figure 7.It becomes apparent that when requiring only the lowest level of evidence (evidence level 1), type 1 error is controlled and for all investigated effect sizes there is a large success probability (which -depending on the target product profile -might be desired or not).When requiring a second level of evidence, success probabilities for effect sizes identified as insufficiently promising in our decision rule drop significantly, while success probabilities for large effect sizes stay large.When requiring all three levels of evidence, only treatments with a very large effect size are advanced with a high probability.

Discussion
For this exploratory phase 2b platform trial in NASH, a Bayesian framework has been chosen to incorporate the information from two endpoints (resolution of NASH without worsening of fibrosis (endpoint 1) and/or 1-stage fibrosis improvement without worsening of NASH (endpoint 2)) in the Bayesian decision rule.Based on the regulatory requirements, it has been decided that for the success criteria it is sufficient to demonstrate efficacy in either of the two endpoints.However, based on emerging phase 2b and phase 3 data for compounds under development, it is not sufficient to simply show superiority, instead there should be sufficient evidence that the effect sizes are large enough to show differentiation in order to graduate a treatment from phase 2b to phase 3. Therefore, the Bayesian decision rules have been extended to allow for different levels of evidence.Firstly, a high confidence is needed that the experimental treatment is better than control by any margin.Secondly, high confidence is required that the true effect is at least of moderate effect.Finally, some evidence is required that the true effect is relatively large and competitive with respect to the current landscape of compounds in the development pipeline.
In frequentist trials, a study is powered to show superiority.The assessment whether the observed effect is relevant would be deferred to the lower bound of the confidence interval of the observed effect.Such a strategy could also be translated into a shifted hypothesis test using the duality between confidence intervals and frequentist tests.The proposed Bayesian framework allows a convenient way of combining the evidence from both endpoints and the extension to futility stopping rules in interim analyses.In this simulation study, we proposed that in order to declare success it is sufficient to demonstrate efficacy in either of the two endpoints, whereas both endpoints have to show insufficient efficacy to declare futility.In a frequentist trial, further multiplicity adjustments would have been needed for both repeated significance testing and testing two endpoints.We also investigated different ways in which two correlated binary endpoints can be simulated.The investigation of different correlations is critical for the trial's probability of success.Due to using an "OR" criterion (i.e. the treatment is graduated if either of the endpoints are met), a higher correlation will lead to decreased probabilities.Therefore, for planning purposes and determining sample sizes, it might be preferable to assume larger correlations, but ultimately this also depends on the chosen method to generate correlated binary endpoints.
A draft of the proposed platform trial protocol including some simulation results were discussed with FDA in a critical path innovation (CPIM) meeting in January 2022 [36].There were no objections to using Bayesian decision rules and it was acknowledged that in this phase 2b setting there is no need to correct for multiplicity on a platform level resulting from testing several compounds in the same trial.Regarding sharing of data, FDA supported the idea of using concurrent control data (defined as data of trial participants who were randomized to control arm in another cohort while randomization is ongoing in the cohort of interest who meet the inclusion/exclusion for all cohorts), but was opposed to using non-concurrent control data.Similar responses were received when discussing this NASH platform trial design with EMA in an Innovation Task Force (ITF) meeting in November 2022.Within EU-PEARL, these results currently serve as a discussion foundation on how to choose the sample size and decision rules for the platform trial protocol as it seems that there is still some latitude with respect to type 1 error, especially considering that this is planned as a phase 2b design for decision making and not for registration purposes.
One of the main advantages of platform trials is that they reduce the time and number of trial participants required to make a decision [2].This is usually achieved both by operational and statistical efficiencies, such as multiple interim analyses and sharing data concurrently/non-concurrently across investigational cohorts.The impact of the interim analyses in reducing the duration of study and/or the number of participants will be lessened when the recruitment rate is fast relative to the time needed to observe the final endpoints (for example, if all of the participants needed for analysis in an investigational cohort are recruited in 3-6 months and the interim and final analyses are conducted at 12 months, there will be no savings in the number of participants entering the platform trial).This could lead to a large number of participants who have been randomized into a cohort in the trial, but have not yet had their primary endpoint observed when a decision is made to stop the cohort either due to superior or futile efficacy [37].This problem becomes most evident in the trial design investigated here, when the sample size per treatmen arm is 75 with an accrual rate of 6 participants/week, even if a decision is made at one of the interim analyses to stop the cohort for superior efficacy, potentially there will be regulatory interest in seeing the treatment effect of the full cohort to provide further demonstration of the robustness of the efficacy observed at the interim analysis.
If futility was demonstrated, then ethically you could prevent those randomized who have not reached week 52 from having an additional invasive liver biopsy.Therefore, initiatives to establish validated short-term endpoints in NASH based on biomarkers are critical.Savings in trial participants might be more pronounced when the recruitment rate is slower and/or more treatments are evaluated at the same time and or the sample size is larger.When the effect size is large, we observed time savings of around six weeks and up to 20-25% reduction in the number of required participants when assuming 125 participants per treatment group and overwhelmingly superior or futile efficacy is observed along with the use of a concurrent control, indicating that even under the simple trial simulation assumptions studied the savings can be achieved under the ideal conditions.Further work is needed to determine how a Phase 2b NASH platform trial can achieve greater efficiency in the number of participants that need to be evaluated while making the best decisions in advancing those investigational treatments that have the potential of demonstrating transformative efficacy.
Many extensions of this simulation study can be considered.First, more simulation parameters could be investigated -especially with respect to a range of accrual rates we would expect the savings in trial duration and participants to show meaningful change.The time between new treatments entering was set to 24 weeks and the maximum number of cohorts to 5, with two cohorts starting initially.If these assumptions were changed such that either more or less cohorts would be enrolling concurrently, differences in trial duration and success probabilities might be more pronounced for a concurrent control versus using cohort control groups only.Also, the use of non-concurrent controls could be investigated.We observed no significant changes in success probabilities when the sample size was increased beyond 75 trial participants per arm -therefore we believe no larger sample sizes are warranted based on the simulation parameters that have been evaluated to date.So far, it is assumed that trial participants are equally randomized between open cohorts and within cohorts between treatment arms.The use of response-adaptive-randomization might allow effective treatments to graduate faster -we did not investigate its use further, because we assumed identical treatment effects for all treatments within one platform trial.It was assumed that it is enough to show efficacy on one of the endpoints (i.e."OR" decision rule).Future research could aim to show efficacy on both endpoints (i.e."AND" decision rule).Both co-primary endpoints proposed for the this phase 2b platform trial have been discussed and agreed on by regulatory agencies, i.e. at an ITF meeting with EMA in November 2022 and a CPIM meeting with FDA in January 2022.While the resolution of steatosis or fibrosis improvement are clearly important endpoints, it may be scientifically interesting to evaluate disease progression to NASH cirrhosis.This might disclose that treatment regimens are unable to improve baseline condition, but possibly able to prevent disease progression.The only progression endpoint that is of regulatory interest is the one showing a progression from F2/F3 to F4.However, this is usually part of the clinical outcome composite endpoint evaluated in most NASH phase 3 trials and it would require longer follow-up than 12-18 months that is used in most Phase 2b NASH clinical trials.So when simulating phase 3 designs for NASH it should also incorporate the cumulative incidence of important clinical outcomes [38,39] as the baseline risk and include clinically relevant outcomes such as prevention of cirrhosis.However, the design of a phase III platform trial goes beyond the scope of this paper.The proposed platform trial is designed to incorporate phase 2b trials, but not phase 3 trials.
To conclude, we have found, based on our assumptions, that a NASH phase 2b platform study design employing Bayesian decision rules can demonstrate some time efficiencies when the effect sizes are large for either primary (histologic) endpoint, which is consistent with accelerating the development of transformational therapies.These time efficiencies would be in addition to those offered by a platform study in terms of accelerating start-up activities thereby creating a favorable proposition for drug developers, especially small biotechs.It is possible that once short-term (i.e., 12-24 week) biomarkers (alone or in combination) have sufficient data to predict effiacy for histological and/or clinical outcomes, the use of a platform design may become even more powerful using a phase 2a/2b seamless design, which could offer not just time savings but reduced sample size as well.However, for the moment based on current knowledge in the NASH field, the proposed design offers the benefits of potentially creating more opportunities for participants and overall reduced trial conduct time for developers leading to the main goal of EU-PEARL: providing tools for accelerating drug development with increased efficiency in a cross-sponsor approach that will ultimately benefit patients.
Figure 4: Success probabilities for the treatment arm with respect to the response rate for endpoint 1 (E1 Suc Rate; columns), the response rate for endpoint 2 (E2 Suc Rate; rows), the correlation between the two endpoints (x-axis), the type of data sharing used (point shape) and the planned cohort sample size per arm (colour).The blue horizontal line marks 80% as a common target for the power and the red horizontal line marks 10% as a common target for type 1 error in early phase clinical trials.When the drug is truly effective, success probabilities correspond to power; when the drug is not effective, success probabilities correspond to type 1 error.
Figure 5: Average platform trial duration in weeks with respect to the response rate for endpoint 1 (E1 Suc Rate; columns), the response rate for endpoint 2 (E2 Suc Rate; rows), the correlation between the two endpoints (x-axis), the type of data sharing used (point shape) and the planned cohort sample size per arm (colour).
Figure 6: Cumulative probabilities to make a decision early (i.e.making an early efficacy or futility decision either at the first or second interim analysis) with respect to the response rate for endpoint 1 (E1 Suc Rate; columns), the response rate for endpoint 2 (E2 Suc Rate; rows), the correlation between the two endpoints (x-axis), the type of data sharing used (point shape) and the planned sample size per cohort (left panel 150 and right panel 250).
Figure 7: Success probabilities for the treatment arm with respect to the response rate for endpoint 1 (E1 Suc Rate; columns), the response rate for endpoint 2 (E2 Suc Rate; rows), the correlation between the two endpoints (x-axis), the type of data sharing used (point shape), the level of evidence required (colour) and the planned sample size per cohort (left panel 150 and right panel 250).The blue horizontal line marks 80% as a common target for the power and the red horizontal line marks 10% as a common target for type 1 error in early phase clinical trials.Level of evidence required refers to how many of the Bayesian efficacy rules specified in section 2.2 need to simultaneously hold for a treatment to be declared efficacious.
Final analysis Declare efficacy if either of the following two conditions is true: 1.
• Posterior probability of at least 95% that success rate of NASH resolution in treatment arm is larger than in SOC arm AND STOP, if (P (π E > π S + δ F,T k,1 |Data) < γ F,T k,1 ) ∨ (P (π E > π S + δ F,T k,2 |Data) < γ F,T k,2 ) ∨ • • • ∨ whereby π S denotes the response rate in the standard-of-care arm, π E denotes the response rate in the experimental treatment arm, T ∈ 1, 2, ..., N denotes the analysis time point, subscript k denotes the endpoint (k ∈ {1, ..., K}) and subscripts l and m denote the possibility to have multiple decision rules at any given point in time.At interim (T ∈ {1, 2, ..N − 1}), if neither a decision for early efficacy or futility is made, the cohort continues unchanged.At final (T = N ), if the efficacy boundaries are not met, the cohort automatically stops for futility.The initial letters E or F in the superscript of the thresholds δ and γ indicate if this boundary is used to stop for efficacy (G) or futility (F).Choosing, for example, γ E,u k,1 = 1 ∀u ∈ {1, ..., k} corresponds to not allowing early stopping for efficacy for endpoint k at interim 1.If at any point in time both stopping for early efficacy and futility is allowed, parameters need to be chosen carefully such that GO and STOP and decisions are not simultaneously possible.Please note that the requirements in equation 3 refer to the evidence needed to declare efficacy or futility of a single endpoint, i.e. while we might have multiple efficacy requirements which need to be simultaneously fulfilled to declare a single endpoint efficacious, we advance the treatment if it is found to be efficacious in either of the two endpoints (analogously for futility decisions).
In order to achieve the multi-level decision rules described in Table 2 and generalized in equation 3, we set the following parameters for efficacy (l = 3) and futility (m = 1): k,1 = 0, γ G,T k,1 = 0.95, γ G,T k,2 = 0.85, γ G,T k,3 = 0.60, ∀T ∈ {1, 2, 3} ∀k ∈ {1, 2} Figure 8: Beta distributions corresponding to the posterior we would observe if a Beta(1,1) prior and a sample size of 75 was used and the observed response rate would equal 0.5.Panel a: The posterior probability for a success rate greater or equal to 0.5 is 50%.If in our Bayesian decision rules we set a target of 0.5 and require a confidence of 50%, for large sample sizes we will graduate compounds with true responder rate of 0.5 in 50% of the cases, i.e. achieve a power of 50%.Panel b: The posterior probability for a success rate greater or equal to 0.45 is 81%.If in our Bayesian decision rules we set a target of 0.45 and require a confidence of 81%, for large sample sizes we will graduate compounds with true responder rate of 0.5 in 50% of the cases, i.e. achieve a power of 50%.Panel c: The posterior probability for a success rate greater or equal to 0.40 is 96%.If in our Bayesian decision rules we set a target of 0.40 and require a confidence of 96%, for large sample sizes we will graduate compounds with true responder rate of 0.5 in 50% of the cases, i.e. achieve a power of 50%.In order to achieve larger power values, required confidences and target success rates need to be adapted.

Figure 1 :
Figure1: Phase 2b platform trial design in non-alcoholic steatohepatitis (NASH).After an initial inclusion of two cohorts consisting of control (usually the standard-of-care, "SOC") and "regimen" arm (which could be a monotherapy or a combination therapy), more cohorts of the same structure are entering the trial over time.Within each cohort, several interim and a final analysis are conducted using the co-primary binary endpoints "NASH resolution without worsening of fibrosis" and "Fibrosis improvement without worsening of NASH".The platform trial ends when all cohorts have been evaluated.

1 = 0 )Level 2 :
We need high confidence (e.g.,  2 = 0.85) that treatment is better than control by a certain margin (here  2 = 30 percentage points) Level 3: Need some evidence (e.g.,  3 = 0.60) that large differences (here  3 = 40 percentage points) between treatment and control are plausible Futilty: Stop the trial early, if there is very little confidence that the treatment is better by some margin than the control (here 25 percentage points) Difference in response rates between Treatment -

Figure 9 :
Figure 9: sens SL and spec SL with respect to p k 1• (columns), p k •1 (rows) and ρ (shade of color).For scenarios p k 1• < p k •1 , there is an upper u spec < 1 bound for spec SL .For scenarios p k 1• > p k •1 , there is an upper u sens < 1 bound for sens SL .For scenarios p k 1• = p k •1 either sens SL (p k 1• = p k •1 ≤ 0.5) or spec SL (p k 1• = p k •1 ≥ 0.5) can take any values between 0 and 1.If the matrix of figures was transposed, we would see sens LS and spec LS instead of sens SL and spec SL .Please note that in the Figure the label "rho" is used for ρ.

Figure 10 :
Figure 10: Relationship between specified correlation of the two continuous endpoints ρ k prior to dichotomization and achieved correlation φ k of the two binary endpoints after dichotomization.Only when p k 1• = p k •1 = 0.5 can φ k ∈ [0, 1] be achieved, otherwise it is bounded above and/or below (see paragraph "Correlation" in section "Implicit specification" for more details on the bounds).The more p k 1• and p k •1 differ, the closer either the upper or lower bound of φ k is to 0. Please note that in the Figure the labels "phi" and "rho" are used for φ and ρ.

Table 2 :
Different levels of evidence required to graduate treatment for efficacy.The ordering is hierarchical in nature, i.e. requiring two levels of evidence means level 1 and level 2 need to be simultaneously fulfilled.E1 and E2 refer to endpoint 1 (resolution of NASH without worsening of fibrosis) and endpoint 2 (1-stage fibrosis improvement without worsening of NASH) respectively.

Table 3 :
Specification of important simulation parameters.Values are either fixed or varied in different simulation scenarios.For different simulation parameters, we differentiate between parameters that are considered a design choice ("D") and parameters that are considered an assumption ("A") regarding the future course of the platform trial or treatment effects (see second column "Type").

•
Posterior probability of at least 85% that success rate of NASH resolution in treatment arm is by at least 30 percentage points larger than in SOC arm AND Posterior probability of at least 85% that success rate of fibrosis improvement in treatment arm is by at least 17.5 percentage points larger than in SOC arm AND• Posterior probability of at least 60% that success rate of fibrosis improvement in treatment arm is by at least 25 percentage points larger than in SOC armA.1.2FormalDefinitionBasedon the decision rules given in equation 2 in section 2.2, the proposed multi-component decision rules for several endpoints and interim analyses can be generalized as follows:GO, if (P (π E