Cabozantinib versus everolimus, nivolumab, axitinib, sorafenib and best supportive care: A network meta-analysis of progression-free survival and overall survival in second line treatment of advanced renal cell carcinoma

Background Relative effect of therapies indicated for the treatment of advanced renal cell carcinoma (aRCC) after failure of first line treatment is currently not known. The objective of the present study is to evaluate progression-free survival (PFS) and overall survival (OS) of cabozantinib compared to everolimus, nivolumab, axitinib, sorafenib, and best supportive care (BSC) in aRCC patients who progressed after previous VEGFR tyrosine-kinase inhibitor (TKI) treatment. Methodology & findings Systematic literature search identified 5 studies for inclusion in this analysis. The assessment of the proportional hazard (PH) assumption between the survival curves for different treatment arms in the identified studies showed that survival curves in two of the studies did not fulfil the PH assumption, making comparisons of constant hazard ratios (HRs) inappropriate. Consequently, a parametric survival network meta-analysis model was implemented with five families of functions being jointly fitted in a Bayesian framework to PFS, then OS, data on all treatments. The comparison relied on data digitized from the Kaplan-Meier curves of published studies, except for cabozantinib and its comparator everolimus where patient level data were available. This analysis applied a Bayesian fixed-effects network meta-analysis model to compare PFS and OS of cabozantinib versus its comparators. The log-normal fixed-effects model displayed the best fit of data for both PFS and OS, and showed that patients on cabozantinib had a higher probability of longer PFS and OS than patients exposed to comparators. The survival advantage of cabozantinib increased over time for OS. For PFS the survival advantage reached its maximum at the end of the first year’s treatment and then decreased over time to zero. Conclusion With all five families of distributions, cabozantinib was superior to all its comparators with a higher probability of longer PFS and OS during the analyzed 3 years, except with the Gompertz model, where nivolumab was preferred after 24 months.


Methodology & findings
Systematic literature search identified 5 studies for inclusion in this analysis. The assessment of the proportional hazard (PH) assumption between the survival curves for different treatment arms in the identified studies showed that survival curves in two of the studies did not fulfil the PH assumption, making comparisons of constant hazard ratios (HRs) inappropriate. Consequently, a parametric survival network meta-analysis model was implemented with five families of functions being jointly fitted in a Bayesian framework to PFS, then OS, data on all treatments. The comparison relied on data digitized from the Kaplan-Meier curves of published studies, except for cabozantinib and its comparator everolimus where patient level data were available. This analysis applied a Bayesian fixed-effects network meta-analysis model to compare PFS and OS of cabozantinib versus its comparators. The log-normal fixed-effects model displayed the best fit of data for both PFS and OS, and showed that patients on cabozantinib had a higher probability of longer PFS and OS than patients exposed to comparators. The survival advantage of cabozantinib increased over time for OS. For PFS the survival advantage reached its maximum at the end of the first year's treatment and then decreased over time to zero. PLOS  Introduction STA, the HR NMA approach was used without formally testing for violation of PHs assumption, and this was criticized by NICE. No additional analysis has been published using an alternative method. NMAs based on parametric curves compare the shape and scale parameters of each distribution fitted to the survival curves, and do not assume PHs between the pairwise comparators.
The objective of this study is to compare progression free survival (PFS) and OS and of cabozantinib to everolimus, nivolumab, axitinib, sorafenib and BSC by using the NMA method based on parametric survival curves as described by Ouwens et al. 2010 [18] and as used by Wiecek & Karcher (2016) in their analysis of cabozantinib and nivolumab OS relative gains [16]. Hence, this study will provide an update of the analysis done by Wiecek & Karcher (2016) on the nivolumab versus cabozantinib OS comparison using mature OS data for cabozantinib, while testing an additional distribution (log-normal). It will also provide the first comparison of cabozantinib OS versus other treatments in second line treatment of aRCC. Finally, to the authors' knowledge, this study also provides the first PFS comparison of cabozantinib and its comparators using the NMA method based on parametric survival curves. These particular parametric survival distributions were chosen because they are typically required in decision analytic models of cost-effectiveness as part of the health technology assessment agency submissions, such as those for NICE.

Study selection and data extraction
This systematic literature review aims to provide the evidence needed for the NMA. Search strategies were designed to identify any studies on cabozantinib and possible comparators. Based on the results of this broad systematic literature search, inclusion and exclusion criteria were then applied for the selection of studies to inform the NMA. The PICOS framework guiding the development of the search strategy is shown in Table 1 and further search parameter restrictions are shown on Table 2. The details of databases searched are shown on Table 3. The search protocol is presented in full in S2 File search protocol.

Category Details
Each of the records identified during the search was assessed for relevance against predefined inclusion and exclusion criteria ( Table 4). Copies of potentially relevant full papers were obtained and further selection was undertaken based on full text review. Double independent record selection was undertaken during screening of titles/abstract as well as full texts, and discrepancies were resolved after discussion between reviewers or by a third reviewer.
To identify relevant evidence, a clear definition of the study participants, interventions, comparison groups, outcomes and study types of interest was required. In order to ensure that the included studies were sufficiently homogeneous to form part of a NMA, only prospective comparative randomised controlled trials (RCTs) were included. Retrospective studies were excluded from the review and NMA.

Trial selection for the network meta-analysis
For the NMA the following comparators were included: axitinib, everolimus, nivolumab, sorafenib, sunitinib and BSC/placebo. Studies with treatment which were not relevant for this

Statistical analysis
Patient level data for the METEOR study were provided by the study sponsor. For other studies in the evidence network, the method published by Guyot et al. [19] was used to estimate the number of deaths and the number of patients censored every month from the published Kaplan-Meier curves [14;20-22]. We carried out statistical analyses as well as used visual inspection to assess whether the proportional hazards assumption is violated. We used tests and graphs based on the Schoenfeld residuals and Therneau and Grambsch test. The Bayesian NMA was implemented with the following five parametric survival functions: log-normal, log-logistic, Weibull, Gompertz and exponential distributions, on the extracted PFS and OS data, as described by Ouwens et al. 2010 [18]. The Bayesian approach was chosen because it facilitates estimations on pooled data. A posterior probability distribution of this pooled relative effect was obtained [18;23]. Fixed-effects models were considered for this analysis and additional random-effects models were compared with fixed-effects models with purpose of heterogeneity and inconsistency checking. Fixed-effects model and random-effects model are defined in S3 File algorithms for fixed effect model and S4 File algorithms for random effect model. NMA comparing HRs was also carried out. We compared the logarithms of HRs, as described by Caldwell et al. [24]. Analysis was executed in R package netmeta using fixed effect model.
Bayesian estimation of survival parameters. Four of the models assumed two-parameter distributions (log-normal, log-logistic, Weibull, Gompertz). One model assumed one-parameter exponential distributions i.e. providing a fixed HR and hence assuming time-independent hazard ratios. Model parameters were estimated using a Markov Chain Monte Carlo (MCMC) method on WinBUGs [23]. The WinBUGs sampler was run for 50,000 iterations with the first 25,000 iterations discarded as "burn-in". Convergence of the chains was checked using the Gelman-Rubin statistic [25]. Further details on the programming of the parametric survival curves are provided in S5 File Programming code for NMA.
Under the verification of the transitivity for each applied distribution (S6 File transitivity property) and the absence of heterogeneity or inconsistency in the network (S7 File Heterogeneity & inconsistency), Bayesian meta-analysis models were used to determine the difference of treatment effects. Goodness of model fit is specified in S8 File Goodness of fit.

Literature selection
The systematic literature search for RCTs on cabozantinib and 6 of its comparators retrieved 6,612 citations. After excluding duplicates (n = 1,033) and screening against inclusion/exclusion criteria 5,182 titles/abstracts were excluded. 400 citations were found eligible for the screening on full-text level. 95 of these 400 records were systematic reviews, meta-analyses or health technology assessment (HTA). Reference lists of these publications were checked for any further relevant studies. This process did not yield any additions. Of the 305 full-text articles 241 publications were excluded. Due to language restrictions two records were excluded: both records were in Chinese with an English abstract. The abstracts indicated that both are systematic reviews with meta-analysis. In total, 65 publications, referring to 19 different studies, were considered for potential inclusion into the NMA, as shown in the PRISMA chart in Fig 1. 64 of the studies were identified through the systematic literature search. One additional paper on the METEOR study was published after the date of the literature search, as shown in the PRISMA chart (Fig 1), and included for further analysis.
To perform a NMA, the studies must form a connected network. Network diagrams showing which of the treatments and comparator treatments are linked for each outcome were developed. In total 19 studies were identified for potential inclusion into the NMA. Multiple publications reporting the same study were identified and grouped as associated references. The primary RCT data sources identified in the systematic literature search are summarised in Table 5. Of the identified studies, ten studies were excluded because these were comparisons of everolimus or sorafenib to agents out of scope of this study: bevacizumab+sorafenib, GDC-0980, MK2206, AZD2014, apitolisib, temsirolimus, dovitinib, BNC105P+everolimus, tivozanib, lenvatinib, and lenvatinib + everolimus. These studies were excluded because they neither contained comparison to a treatment of interest (cabozantinib, everolimus, nivolumab, axitinib, sorafenib or best supportive care), nor did they provide a link between comparators that would not otherwise have a common comparator. The studies that have been excluded for this reason are: NCT01442090, NCT01239342, ZEBRA, DusrupTOR-1, ROVER, NCT02330983, TIVO-1, GOLD, INTORSECT, NCT01136733 [26][27][28][29][30][31][32][33][34][35]. Further four studies were excluded from the network meta-analysis: RECORD-3 [36], SWITCH [37], ESPN [38], and study by Ratain et al. 2006 [39]. The main reason for exclusion was sequential study design. Table 6 below gives details of the further exclusions.
The NMA was planned on the endpoints PFS and OS. These are commonly selected as the primary and secondary efficacy endpoints in oncology trials, including in trials with aRCC population. Data were extracted by one person from the reports on a pre-defined extraction template in excel. Data were sought and extracted for PFS and OS (mean median, associated hazard ratios, and Kaplan-Meier data, plus all associated confidence intervals). Data availability for OS and PFS hazard ratios and KM curves was assessed in all included trials.
Intent-to-treat (ITT) and cross-over results (in those trials where cross-over was present) were identified for the OS endpoint. PFS can be measured by an independent review  [26] committee (IRC) and investigators (INV). IRC assessment of disease progression was deemed likely to lead to the least biases, and hence it was prioritised if available. INV-assessed PFS was considered only in cases where IRC-assessed PFS was not available. In three of the included studies (RECORD-1, TARGET and AXIS) PFS was measured in the interim analysis by the IRC, and no further IRC-assessed updates were reported in subsequent publications. Further publications reported final OS results, and in some studies PFS continued to be assessed by INV. In these cases interim results for IRC-assessed PFS were used. In CheckMate025 study (nivolumab versus everolimus) no IRC-assessed PFS could be identified. In the nivolumab NICE single technology appraisal the manufacturer stated that disease assessment was not conducted independently in CheckMate025. The stated reason was that the CheckMate025 trial was designed with OS as primary endpoint, and independent review of secondary endpoints was not deemed necessary, as per regulatory guidelines [40].
A key consideration for any NMA is whether the studies that have been identified are suitably homogeneous to facilitate reliable comparison. This similarity comparison is achieved by comparing selected data from candidate studies; covariates that act as relative treatment effect modifiers must be similar across trials [41]. The similarity of the studies in each network was assessed ( Table 7). The availability of subgroup results for PFS and OS endpoints was also assessed. The final network utilised in the NMA is presented in Fig 2. The network for OS and PFS endpoints are the same.
There were differences between the included trials, as shown in Table 7. The main sources of difference were presence/absence of a cross-over trial design (RECORD-1, TARGET), the number and type of prior therapies as well as baseline prognostic scores (e.g. Memorial Sloan-Kettering Cancer Center [MSKCC] score).
Cross-over is present in RECORD-1 and TARGET studies. Hence, cross-over has an impact on cabozantinib vs axitinib and BSC comparisons. However, comparison to nivolumab is not impacted by the cross-over issue. In the RECORD-1 trial the estimate for OS HR for everolimus vs placebo (BSC) is 0.87 [0.65, 1.17] in ITT population and 0.60 [0.22, 1.65] once adjusted to cross-over by using rank-preserving structural failure time (RPSFT) model published by Korhonen et al. [43]. The RPSFT model relies on assumption of constant effect of active treatment (in this case everolimus) in terms of relative survival time. Hence, the effect does not depend on when active treatment was initiated. Since the method requires additional censoring of patient data, the precision of the HR estimate is lower than for the ITT estimate. However, the method was shown to be preferable to simple adjustments, such as censoring of patients at time of crossover [43]. It should be noted that one other possible approach has been considered by Hollaender, using inverse probability of censoring weights and multivariate Cox models [44]. Results in best multivariate model (ranked according to Akaike information criterion) were HR = 0.47 [0.27, 0.82]. It relies on a strong (and un-testable) assumption of no unmeasured confounders, therefore RPSFT model was preferred, but the implications of using Cox model adjustment are also discussed below. In the TARGET study, an analysis with censoring of placebo-assigned patients who crossed over to sorafenib at the start of cross-over was conducted in addition to the ITT analysis [21]. The adjustment methodology is simple censoring of all cross-over patients. Trials included in network of evidence for the analysis were different in number of allowed prior therapies and the distribution of compounds in patient cohorts. In METEOR study more than one prior therapy was allowed. Patients were included in the study if they had received at least one previous VEGFR TKI (there was no limit to the number of previous treatments). In CheckMate025 patients were eligible to participate if they had received one or two previous regimens of antiangiogenic therapy. In RECORD-1, previous therapy with sorafenib, sunitinib or both was allowed. TARGET study included patients if they had progressed after one systemic treatment within the previous 8 months. AXIS study patients had previously received one previous systemic first line regimen with a sunitinib-based, bevacizumab plus interferonalfa-based, temsirolimus-based, or cytokine-based regimen, which reflected regimens with regulatory approvals at the time of study design. In our study we only included the prior-sunitinib sub-population, because it was considered more comparable to the baseline study population of the METEOR study. Table 7 summarises the baseline prior therapies for each study and shows the availability of results for subgroups of patients by prior therapy. For CheckMate025 results stratified by number of prior therapies received were identified in nivolumab NICE  [40]. Results were not identified by type of prior therapies. For RECORD-1 stratified estimates were available for PFS, but not OS. In the TARGET study publication no subgroup data were identified that stratified results by number/type of prior therapies. AXIS study reported results by type of first line therapy. Due to lack of consistency and availability of results across all trials in the network, it is not possible to analyse results by prior therapy. In the METEOR study results were consistent regardless of number of prior therapies (see Table 8). Available results for Check-Mate025, RECORD-1, and AXIS studies are reported in Table 9, Table 10 and Table 11, respectively. These tables illustrate the differences in reported information between the trials. Evidence suggests that the number of prior therapies does not affect the relative efficacy of cabozantinib vs everolimus. For OS, nivolumab vs everolimus in CheckMate025 shows consistent results for patients with 1 and 2 prior therapies, although results were not statistically significant in the subgroup who received 2 prior treatments.
The MSKCC prognosis score was commonly used to stratify PFS and OS estimates: PFS in METEOR, RECORD-1 and AXIS and OS in METEOR and CheckMate025. TARGET trial did not include any patient with poor MSKCC prognosis and no subgroup analysis was presented by MSKCC prognosis. No subgroup result was identified for initial prognosis for AXIS study. An overview of identified HRs by prognosis is shown on  and of everolimus compared to placebo. As patients with poor prognosis were not included in the TARGET study, excluding such patients from analysis might lead to PFS HR of cabozantinib vs axitinib more favourable to cabozantinib, but data (TARGET trial intermediate/favourable subgroups) is missing to conduct such comparison quantitatively.

Fig 3. Fitted PFS based on the best fitting Bayesian fixed-effects model (log-normal) overlaid on extracted Kaplan-Meier (KM) data, with shaded areas representing 95% credible intervals.
Risk of bias was assessed with an adapted checklist for RCTs as proposed by the Centre for Reviews and Dissemination. Criteria for quality assessment included adequacy of randomization method, allocation concealment, homogeneity of baseline characteristics between Relative efficacy in second line advanced renal cell carcinoma treatment groups and blinding. The study quality assessment was conducted by two independent assessors. The quality assessment of included trials showed that demographic and baseline characteristics were balanced between the treatment arms in all included studies. None of the studies reported unexpected dropouts between study groups. All 5 studies reported intentto-treat (ITT) analysis and reported appropriate method to account for missing data. A potential risk of bias arises from investigators, participants and outcome assessors not being blind to treatment allocation in all studies. Effective blinding can ensure that the compared groups receive a similar amount of attention, ancillary treatment and diagnostic investigations. Blinding is not always possible, however, and three of the studies were not double blinded: • METEOR: Patients and investigators were not blinded to study treatment. A masked independent radiology committee assessed progression-free survival, overall survival, tumour response, duration of response, and changes on bone scans. CheckMate025 was an open-label study with no IRC assessment of end-points such as progression-free survival. The lack of blinding of the INV may increase the risk that knowledge of which intervention was received, rather than the intervention itself, affects outcome measurement. The blinding of outcome assessors can be especially important for the assessment of subjectively assessed outcomes.

Statistical analysis
The Schoenfeld residuals and Therneau and Grambsch tests indicate that the proportional hazards assumption holds for METEOR, RECORD-1, and AXIS studies. However, the assumption is violated in CheckMate025 and TARGET studied for both PFS and OS endpoints. Intent-to-treat (ITT) and cross-over results (in those trials where cross-over was present) were identified for the OS endpoint. PFS can be measured by an independent review committee (IRC) and investigators (INV). IRC assessment of disease progression was deemed likely to lead to the least biases, and hence it was prioritized if available. INV-assessed PFS was considered only in cases where IRC-assessed PFS was not available. In three of the included studies (RECORD-1, TARGET and AXIS) PFS was measured in the interim analysis by the IRC, and no further IRC-assessed updates were reported in subsequent publications. Further publications reported final OS results, and in some studies PFS continued to be assessed by INV. In these cases interim results for IRC-assessed PFS were used. In CheckMate025 study (nivolumab versus everolimus) no IRC-assessed PFS could be identified. Table 13 shows the details of data availability by endpoint. Despite the violation of PH assumption, we carried out the comparison of HRs. The results for OS and PFS are shown in Table 14 and Table 15, respectively.
We undertook a network meta-analysis comparing the relative efficacy of cabozantinib and its comparators. Fixed-effects model was preferred to random-effects models for this analysis, based on the preliminary evaluation of heterogeneity. The fixed-effects models provided as good model fitting as the random-effects models (Table 16 and Table 17), but were more robust against the choice of prior distributions. Moreover, the "burn-in" period was shorter than that with the random-effects models.
Model fit statistics indicate that log-normal model provided the best statistical fit for both OS and PFS-see Tables 14 and 15, respectively. While log-normal model provided the best overall statistical fit for the whole network, the best statistical fit for each individual study varied. The fitted PFS and OS curves were superimposed on the extracted Kaplan-Meier data (Figs 3 and 4) to observe the visual fit of extracted data versus modelled data. Visually log-normal model provided a good fit for PFS and OS data across all treatments, with the exception of PFS for sorafenib and axitinib. For the best fitting model (log-normal), PFS and OS for patients under cabozantinib were predicted to be superior compared to all other treatments up to 36 months Figs 5 and 6). Other models (Weibull, Gompertz, log-logistic and exponential) showed similar results, exception for the PFS endpoint under the Gompertz model where nivolumab was preferred after 24 months (additional PFS and OS results are shown in S9 File Additional results for fixed effect model (full network)). The estimated hazard ratios for cabozantinib versus other treatments became more favorable to cabozantinib over time; after the first month for PFS and after the first four months for OS-see Figs 7 and 8.

Discussion
Comparison of cabozantinib and its comparators is possible using both the NMA of constant HRs and comparison of parametric survival curves, when considering data availability. In our main analysis, we applied a Bayesian parametric NMA method to compare PFS and OS for cabozantinib and its comparators over time using five families of distributions. Although PH assumption was violated in the CheckMate025 and TARGET studies, we also carried out a comparison of HRs as an alternative method. A recent NICE STA for nivolumab included a NMA comparing nivolumab to everolimus, axitinib, sorafenib and BSC [17]. This NMA was based on HRs, which we deemed an inappropriate approach due to violation of PHs assumptions. Regardless of the short-comings of the chosen method in the nivolumab NICE STA, the findings were in line with this NMA; nivolumab provided longer OS benefit compared to comparators other than cabozantinib. Wiecek & Karcher (2016) on the other hand applied a Bayesian parametric survival NMA method in order to compare OS for cabozantinib and nivolumab [16]. This analysis found that patients on cabozantinib exhibited a lower hazard of death over nivolumab until the fifth month of treatment, whereas patients on nivolumab exhibited a lower hazard of death after that time point. However, these analyses were based on immature OS data from the METEOR trial [13], and no superiority was found when the full data were incorporated [49].
The results of the HR NMA on the OS endpoint show a trend of improvement with cabozantinib therapy versus axitinib, everolimus, nivolumab and best supportive care. No statistically significant difference between cabozantinib and its comparators were shown, with the exception of comparison to everolimus. Cabozantinib differed significantly from its comparators with regard to PFS improvement. The parametric model that best fitted both the PFS and OS data was the log-normal distribution. In this model, the estimated PFS and OS probabilities for cabozantinib were 0.11 and 0.48 at 24 months. However, for its best comparator nivolumab, those estimated survival probabilities were 0.04 and 0.43 respectively. The model that provided the worst statistical fit to the data was found to be the exponential distribution, which assumed a constant HR over time as in the traditional approach. It, thus, provided further justification for our choice of model that did not require the PHs assumption. The results of the HR NMA are consistent with the results of the survival curve NMA; the PFS for patients receiving cabozantinib was predicted to be superior and the OS showed a trend towards improvement (OS) compared to axitinib, everolimus, nivolumab and best supportive care.
In the Bayesian analysis, the fixed-effects model has been chosen and the random-effects model has also been implemented for a sensitivity analysis, and heterogeneity & inconsistency checking. Even if it enables estimation of an additional between-study covariance matrix, the random-effects model returned quite similar comparison results to the fixed-effects model, which proved the homogeneity and consistency of the NMA at the network level. In the Bayesian framework, the fixed-effects model was favored because of its simplicity and robustness, where "robustness" meant the quickly-reached convergence of simulated Markov chain. However, if experts' opinion were available, random-effects model would have an advantage by permitting the definition of a group of hyper-parameters which could reflect prior belief on the study heterogeneity. In the Bayesian framework, informative prior distributions may have improved the parameter estimation and reduced the uncertainty of estimation.
A limitation of the PFS analysis was the differing definitions of disease progression assessment (investigator versus independent). Independent review committee PFS was not available from CheckMate025 study, and hence investigator assessed PFS was used in this analysis instead. The results of this analysis were reported over 36 months, while the data were extracted from RCTs with shorter follow-up time. The OS in real-life may differ from the estimated OS, given that the treatment persistence could differ from the persistence observed in the RCTs. A possible direction for future work is to implement the generalized gamma distribution, which includes various commonly used parametric survival distributions, such as Weibull, exponential, log-normal distributions. An improved model fit over Weibull or log-normal would then be envisaged [50]. The NMA model extended the classical meta-analysis model by comparing multiple studies with multiple arms. Provided that the network would remain connected, i.e. neighbor studies shared a common treatment, new studies and treatments could be easily added to the model and the approach remains feasible.

Conclusions
Our NMA reviewed and analyzed the existing literature for RCTs examining PFS and OS of cabozantinib, everolimus, axitinib, sorafenib, nivolumab and best supportive care treatments for aRCC in the second-and subsequent line settings. Our review has identified that cabozantinib significantly improves PFS outcomes in aRCC. The results of our NMA did not show a statistical difference between cabozantinib and comparator therapies with regards to OS, except when compared to everolimus. However, the results of the HR NMA and the parametric curve NMA favored cabozantinib.