Boosting the Concordance Index for Survival Data – A Unified Framework To Derive and Evaluate Biomarker Combinations

The development of molecular signatures for the prediction of time-to-event outcomes is a methodologically challenging task in bioinformatics and biostatistics. Although there are numerous approaches for the derivation of marker combinations and their evaluation, the underlying methodology often suffers from the problem that different optimization criteria are mixed during the feature selection, estimation and evaluation steps. This might result in marker combinations that are suboptimal regarding the evaluation criterion of interest. To address this issue, we propose a unified framework to derive and evaluate biomarker combinations. Our approach is based on the concordance index for time-to-event data, which is a non-parametric measure to quantify the discriminatory power of a prediction rule. Specifically, we propose a gradient boosting algorithm that results in linear biomarker combinations that are optimal with respect to a smoothed version of the concordance index. We investigate the performance of our algorithm in a large-scale simulation study and in two molecular data sets for the prediction of survival in breast cancer patients. Our numerical results show that the new approach is not only methodologically sound but can also lead to a higher discriminatory power than traditional approaches for the derivation of gene signatures.


Introduction
Recent technological developments in the fields of genomics and biomedical research have led to the discovery of large numbers of gene signatures for the prediction of clinical survival outcomes.In cancer research, for example, gene expression signatures are nowadays used to predict the time to occurrence of metastases [1,2] as well as the time to progression [3] and overall patient survival [4,5].While the importance of molecular data in clinical and epidemiological research is expected to grow considerably in the next years [6,7,8], the detection of clinically useful gene signatures remains a challenging problem for bioinformaticians and biostatisticians, especially when the outcome is a survival time.After normalization and data pre-processing, the development of a new gene signature usually comprises three methodological tasks: Task 1: Select a subset of genes that is associated with the clinical outcome.
Task 2: Derive a marker signature by finding the "optimal" combination of the selected genes.
Task 3: Evaluate the prediction accuracy of the optimal combination using future or external data.
Task 1, the selection of a clinically relevant subset of genes, is often addressed by calculating scores to rank the univariate association between the survival outcome and each of the genes [9,8].In a subsequent step, the genes with the strongest associations are selected to be included in the gene signature.Task 2, the derivation of an optimal combination of the selected genes, is usually fulfilled by forming linear combinations of gene expression levels based on Cox regression.Due to multicollinearity problems and the high dimensionality of molecular data, a direct optimization of the Cox partial likelihood is often unfeasible [8].Consequently, marker combinations are often derived by combining coefficients of univariate Cox regression models [10], or by applying regularized Cox regression techniques (such as the Lasso [11,12] or ridge-penalized regression [13,14]).Task 3, the evaluation of prediction accuracy, is considered to be a challenging problem in survival analysis.This is because traditional performance measures for continuous outcomes (such as the mean squared error) are no longer applicable in the presence of censoring.In the literature, several approaches to address this problem exist (see, e.g., [15] for an overview).In this article, we focus on the concordance index for time-to-event data (C-index [16,17,18]), which has become a widely used measure of the performance of biomarkers in survival studies [19,20,21,22].Briefly spoken, the C-index can be interpreted as the probability that a patient with a small survival time is associated with a high value of a biomarker combination (and vice versa).Consequently, it measures the concordance between the rankings of the survival times and the biomarker values and therefore the ability of a biomarker to discriminate between patients with small survival times and patients with large survival times.This strategy is especially helpful if the aim is to subdivide patients into groups with good or poor prognosis (as applied in many articles in the medical literature, e.g., [10]).By definition, the C-index has the same scale as the classical area under the curve (AUC) in binary classification: While prediction rules based on random guessing yield C = 0.5, a perfectly discriminating biomarker combination leads to C = 1.Interestingly, the derivation of new gene signatures for survival outcomes via Tasks 1-3 is often addressed by combining completely different methodological approaches and estimation techniques.For example, the estimation of biomarker combinations is usually based on Cox regression and is hence carried out via the optimization of a partial likelihood criterion.On the other hand, the resulting combinations are often evaluated by using the C-index [21,22] which has its roots in the receiver operating characteristics (ROC) methodology.This methodological inconsistency is also problematic from a practical point of view, as the marker combination that optimizes the partial log likelihood criterion is not necessarily the one that optimizes the C-index.In other words, if the C-index and therefore the discriminatory power is the evaluation criterion of interest, it may be suboptimal to use a likelihood-based criterion to optimize the marker combination.This issue is, of course, not only problematic in survival analysis but also in regression and classification.A theoretical discussion on the differences between performance measures for binary classification can, e.g., be found in [23].To overcome the aforementioned inconsistencies, we propose a unified framework for survival analysis that is based on the same statistical methodology for gene selection (Task 1), derivation of an optimal biomarker combination (Task 2) and the evaluation of a new gene signature (Task 3).As will be demonstrated, all three tasks can be addressed by using the concordance index for time-to-event data as performance criterion.While the C-index has already been proposed for gene selection (Task 1) and the evaluation of prediction accuracy (Task 3) [9,15], the main contribution of this article is a new estimation technique that addresses the development of optimal combinations of genes (Task 2).To achieve this goal, we propose a method for finding gene combinations that is based on the gradient boosting framework [24].As will be shown, it is possible to use boosting to derive prediction-optimized gene combinations via direct optimization of the C-index.Because this new approach allows for using the C-index to address all three tasks, the proposed method leads to a consistent framework for the derivation of gene signatures in biomarker studies where the C-index is the main performance criterion.

Notation
We first introduce some basic notation that will be used throughout the article.Let T ∈ R + be the survival time of interest and X = (x 1 , ..., x p ) ∈ R p a vector of predictor variables.In addition to the gene expression levels, X may contain the measurements of some clinical predictor variables.Denote the conditional survival function of T given X by S(t|x) = P(T > t|X = x).The probability density and survival functions of T are denoted by f (t) and S(t), respectively.Further let T cens ∈ R + be a random censoring time and denote the observed survival time by T := min(T, T cens ).The random variable ∆ := I(T ≤ T cens ) indicates whether T is right-censored (∆ = 0) or not (∆ = 1).A prediction rule for T will be formed as a linear combination where β is an unknown vector of coefficients.We generally assume that the estimation of β is based on an i.i.d.learning sample {( In case of the Cox regression model, for example, η is related to T by the equation where Λ 0 (t) is the cumulative baseline hazard function.Because there is a one-to-one relationship between η and the expected survival time E(T |X), the linear combination η can be used as a biomarker to predict the survival of individual patients.

Concordance index
Our proposed framework to derive and evaluate biomarker combinations is based on the concordance index ("C-index") which is a general discrimination measure for the evaluation of prediction models [16,17].It can be applied to continuous, ordinal and dichotomous outcomes [25].For time-to-event outcomes, the C-index is defined as where T j 1 , T j 2 and η j 1 , η j 2 are the event times and the predicted marker values, respectively, of two observations in an i.i.d.test sample {( Tj , ∆ j , X j ), j = 1, . . ., N }.By definition, the C-index for survival data measures whether large values of η are associated with short survival times T and vice versa.Moreover, it can be shown that the C-index is equivalent to the area under the time-dependent ROC curve, which is a measure of the discriminative ability of η at each time point under consideration (see [26], p. 95 for a formal proof).
During the last decades, the C-index has gained enormous popularity in biomedical research; for example, searching for the terms "concordance index" and "c-index" in PubMed [27] resulted in 1156 articles by the time of writing this article.Generally, a value of C close to 1 indicates that the marker η is close to a perfect discriminatory power, while a marker that does not perform better than chance results in a value of 0.5.For example, the famous Gail model [28] for the prediction of breast cancer is estimated to yield a value of C = 0.67 [29].
Being a flexible discrimination measure, the C-index is especially useful for selecting and ranking genes from a pre-processed set of high-dimensional gene expression data (Task 1 described in the Introduction).In other words, Task 1 can be addressed by computing the C-index (and hence the marginal discriminatory power) for each individual gene or biomarker, where only those genes with the highest C-index are incorporated into the yet-to-derive optimal combination (Task 2).Although there exist various other ways to rank genes and select the most influential ones, the C-index has been demonstrated to be especially advantageous for this task [9].An estimator of the C-index for survival data is given by with j, k ∈ {1, . . ., n} ("Harrell's C for survival data", [30]).Generally, C surv is a consistent estimator of the C-index in situations where no censoring is present.However, because pairs of observations where the smaller observed survival time is censored are ignored in formula (4), C surv is known to show a notable upward bias in the presence of censoring.This bias tends to be correlated with the censoring rate [30,15].To overcome the censoring bias of C surv , Uno et al. [18] proposed a modified version of C surv , which is defined as where ĜL n (t) denotes the Kaplan-Meier estimator of the unconditional survival function of T cens (estimated from the learning data).In the following, we will assume that the censoring times are independent of X.Under this assumption, C Uno is a consistent and asymptotically normal estimator of C (see [18], pp.1113-1115).Consistency is ensured by applying inverse probability weighting (using the weights ∆ j /( ĜL n ( Tj ) 2 ), which account for the inverse probability that an observation in the test data is censored [31]).Numerical studies suggest that C Uno is remarkably robust against violations of the random censoring assumption [32].Apart from the estimators C surv and C Uno , there exist various other approaches to estimate the probability in (3) (see, e.g., [15] for an overview).Most of these approaches are based on the assumptions of a Cox proportional hazards model, so that they are not valid in case these assumptions are violated.Because C Uno is model-free and because the consistency of C Uno is guaranteed even in situations where censoring rates are high (in contrast to the estimator C surv ), we will base our boosting method on C Uno .

Boosting the concordance index
The core of our proposed framework to address Tasks 1 -3 is the derivation of a predictionoptimized linear combination of genes that is optimal w.r.t. to the C-index for time-to-event data.Our approach will be based on a component-wise gradient boosting algorithm [24] that uses the C-index as optimization criterion.Gradient boosting algorithms [33] are generally based on a loss function ρ(T, η) that is assumed to be differentiable with respect to the predictor η ≡ η(X).The aim is then to estimate the "optimal" prediction function by using gradient descent techniques.As the theoretical mean in ( 6) is usually unknown in practice, gradient boosting algorithms minimize the empirical risk R := n i=1 ρ(t i , η(x i )) over η instead.When considering the C-index for survival data, the aim is to determine the optimal predictor η * that maximizes the concordance measure -which is essentially the solution to Task 2. Hence a natural choice for the empirical risk function R would be the negative concordance index estimator Setting R = − C Uno (T, η), however, is unfeasible because C Uno (T, η) is not differentiable with respect to η i and can therefore not be used in a gradient boosting algorithm.To solve this problem, we follow the approach of Ma and Huang [34] and approximate the indicator function in (7) by the sigmoid function Here, σ is a tuning parameter that controls the smoothness of the approximation (details on the choice of σ will be given in the Numerical Results section).Replacing the indicator function in (7) by its smoothed version results in the smoothed empirical risk function with weights By definition, the smoothed empirical risk − C smooth (T, η) is differentiable with respect to the predictor η i .Its derivative is given by In the next step of the gradient boosting algorithm, the derivative in ( 10) is iteratively fitted to a set of base-learners.Typically, an individual base-learner (simple regression tool, e.g., a tree or a simple linear model) is specified for each marker.To ensure that the estimate of the optimal predictor η * is a linear combination of the components of X, we will apply simple linear models as base-learners (cf.[35]).In other words, each base-learner is a simple linear model with one component of X as input variable.Consequently, there are p base-learners, which will be denoted by b l , l = 1, . . ., p.Each base-learner refers to one component of X and therefore to one marker (or gene).
The component-wise gradient boosting algorithm for the optimization of the smoothed C-index is then given as follows: (1) Initialize the estimate of the marker combination η[0] with offset values.For example, set η[0] = 0, leading to β[0] l = 0 for all components l = 1, . . ., p. Choose a sufficiently large maximum number of iterations m stop and set the iteration counter m to 1.
(2) Compute the negative gradient vector by using formula (10) and evaluate it at the marker combination η[m−1] of the previous iteration: (3) Fit the negative gradient vector U [m] separately to each of the components of X via the base-learners b l (•): (4) Select the component l * that best fits the negative gradient vector according to the least squares criterion, i.e., select the base-learner b l * defined by (5) Update the marker combination η for this component: where sl is a small step length (0 < sl ≪ 1).For example, if sl = 0.1, only 10% of the fit of the base-learner is added to the current marker.This procedure shrinks the effect estimates towards zero, effectively increasing the numerical stability of the update step [23,25].
As only the base learner bl * was selected, only the effect of component Else increase m by one and go back to step (2).
By construction, the proposed boosting algorithm automatically estimates the optimal linear biomarker combination that maximizes the smoothed C-index.The principle behind the proposed algorithm is to minimize the empirical risk R = − C smooth (T, η) by using gradient descent in function space, where the function space is spanned by the base-learners b l , l = 1, ..., p.In other words, the algorithm iteratively descents the empirical risk by updating η[m] via the best fitting base-learner.Because the base-learners are simple linear models (each containing only one possible biomarker as predictor variable) and because the update in step (5) of the algorithm is additive, the final solution η[mstop] effectively becomes a linear combination of these markers.The two main tuning parameters of gradient boosting algorithms are the stopping iteration m stop and the step length sl.In the literature it has been argued that the choice of the step length is of minor importance for the performance of boosting algorithms [36].Generally, a larger step length leads to faster convergence of the algorithm.However, it also increases the risk of overshooting near the minimum of R. In the following sections we will use a fixed step-length of sl = 0.1, which is a common recommendation in the literature on gradient boosting [36,37] (and which is also the default value in the R package mboost [38]).The stopping iteration m stop is considered to be the most important tuning parameter of boosting algorithms [37].The optimal value of m stop is usually determined by using cross-validation techniques [24].Small values of m stop reduce the complexity of the resulting linear combination η[mstop] and avoid overfitting via shrinking the effect estimates.In case of boosting the C-index, however, overfitting is less problematic as the predictive performance of η is not related to the actual size of the coefficients but to the concordance of the rankings between marker values and the observed survival times.
As a result, the stopping iteration m stop in this specific case is less relevant and can be also specified by a fixed large value (e.g., m stop = 50000).Regarding the boosting algorithm for the smoothed C-index, an additional tuning parameter is given by the smoothing parameter σ.While too large values of σ will lead to a poor approximation of the indicator functions in (7), too small values of σ might overfit the data (and might therefore result in a decreased prediction accuracy).Details on how to best choose the value of σ will be given in the Numerical Results section.The boosting algorithm presented above is implemented in the add-on software package mboost of the open source statistical programming environment R [39].The specification of the new Cindex() family and a short description of how to apply the algorithm in practice are given in the Appendix.

Evaluation
After having applied the C-index to select the most influential genes (Task 1), and after having used the proposed boosting algorithm to combine the selected genes (Task 2), a final challenge is to evaluate the prediction accuracy of the resulting gene combination (Task 3).Since the C-index was used for Tasks 1 and 2, it is also a natural criterion to evaluate the derived marker combination in Task 3. As argued before, it is advantageous from both a methodological perspective as well as from a practical one to use the same criterion for estimation and evaluation of a biomarker combination.To avoid over-optimistic estimates of prediction accuracy, it is crucial to use different observations for the optimization and evaluation of the marker signature [25,40].This can be achieved either by using two completely different data sets (external evaluation) or by splitting one data-set into a learning sample {( T L i , ∆ L i , X L i ), i = 1, . . ., n} and a test sample {( Tj , ∆ j , X j ), j = 1, . . ., N }.The learning sample is used to optimize the marker combination while the "external" test sample serves only for evaluation purposes.A more elaborate strategy is subsampling (such as five-fold or ten-fold cross-validation), which is based on multiple random splits of the data.In our numerical analysis, we will use subsampling techniques in combination with stratification to divide the underlying data sets into learning and test samples (for details, see the next section).When it comes to the C-index, two additional points have to be taken into consideration: First, as the task is to obtain the most precise estimation for the discriminatory power, it is no longer necessary to use the smoothed version C smooth (which was included for numerical reasons in the boosting algorithm) for evaluation.Consequently, we propose to apply the original estimator C Uno for evaluating biomarker combinations in Task 3. Second, when applying the estimator C Uno to the observations in a test sample, a natural question is how to calculate the Kaplan-Meier estimator ĜL n (t) of the unconditional survival function of T cens .In principle, there are three possibilities for the calculation of ĜL n (t): The Kaplan-Meier estimator can be computed from either the test or from the training data, or, alternatively, from the combined data set containing all observations in the learning and test samples.Following the principle that all estimation steps should be carried out prior to Task 3, we will base computation of the Kaplan-Meier estimator on the learning data.

Simulation study
We first investigated the performance of our approach based on simulated data.The aim of our simulation study was: (i) To analyze if the proposed framework is able to select a small amount of informative markers from a much larger set of candidate variables.
(ii) To check if gradient boosting is able to derive the optimal combination η of the selected markers, and to compare its performance to competing Cox-based estimation schemes.
(iii) To investigate the effect of the smoothing parameter σ that controls the smoothness inside the sigmoid function, as well as potential effects of the sample size and the censoring rate on the performance of our approach.
The simulated survival times are generated via a log-logistic distribution for accelerated failure time (AFT) models [41].They are based on the model equation log(T ) = µ+φW , where T is the survival time, µ and φ are location and scale parameters, and W is a noise variable, following a standard logistic distribution.As a result, the density function for realizations t i can be written as with E(T ) = µ and Var(T ) = π 2 3φ 2 .The p = 1000 possible markers X 1 , ...., X 1000 were drawn from a multivariate normal distribution with pairwise correlation (ρ = 0.5).The true underlying combinations of the predictors were given by Note that only four out of 1000 possible markers have an actual effect on the survival time.Furthermore, those four markers do not only contribute to the location parameter µ but also to the scale parameter φ (cf.[42]) -a setting where standard survival analysis clearly becomes problematic.Additionally to the survival times T , we generated an independent sample of censoring times T cens and defined the observed survival time by T := min(T, T cens ) leading to independent censoring of 50% of the observations.In order to answer the above questions, we investigated the performance of our framework in all three tasks that are necessary to develop new gene signatures in practice (Tasks 1-3 described in the Introduction).To avoid over-optimism and biased results, we simulated separate data sets for the different tasks.In B = 100 simulation runs, we first simulated a data set {( Ti , ∆ i , X i ), i = 1, . . ., 1000} with 1000 observations to pre-select the most influential predictors based on the C-index (Task 1).The optimal combination η of those predictors was later estimated (Task 2) by our boosting algorithm based on smaller training samples {( T L i , ∆ L i , X L i ), i = 1, . . ., n} of size n.For the final external evaluation of the prediction accuracy (Task 3) we simulated a separate test data set {( Tj , ∆ j , X j ), j = 1, . . ., N } with N = 1000.For Task 1, we first pre-selected a subset of p * predictors from the p = 1000 available markers.We ranked the predictors based on their individual values of ĈUno and included only the p * = {5, 10, 30} best-performing markers in the boosting algorithm.The results suggest that the C-index is clearly able to identify markers that are truly related to the outcome: Although all predictors had a relatively high pairwise correlation (ρ = 0.5), the four informative markers had a selection probability of 98.5% for p * = 5 (99% for p * = 10 and 99.5% for p * = 30).To find the optimal combination η of the pre-selected markers (Task 2), we applied the proposed boosting approach on training samples with size n = 100.The resulting coefficients for p * = 5 and smoothing parameter σ = 0.1 are presented in Figure 1.The boosting algorithm seems to be able to derive the optimal combination of the pre-selected markers, as the structure displayed by the coefficients is essentially the same as the one of the underlying true combination η µ .The discriminatory power of the resulting biomarker does not depend on the absolute size of the coefficients: As the C-index is based solely on the concordance between biomarker and survival time, what matters in practice is the relative size of the coefficients.As can be seen from Figure 1, the estimated positive effect for x 1 is larger than the one for x 2 .On the other hand, the negative effect of x 4 is correctly estimated to be more pronounced than the the one of x 3 .The coefficient of the falsely selected marker is on average close to zero.In a third step, we evaluated the performance of the resulting optimized marker combinations (Task 3) on separate test samples.The resulting estimates ĈUno for different simulation settings are presented in Table 1.The highest discriminatory power (median ĈUno = 0.763, range = 0.559-0.819)can be observed for p * = 5, which is closest to the true number of informative markers.We compared the performance of our proposed algorithm to penalized Cox regression approaches such as Cox-Lasso [11,12] and Cox regression with ridge-penalization [13,14] see Figure 2. The proposed boosting approach clearly outperforms the competing estimation schemes, supporting our view that applying traditional Cox regression might be suboptimal if the discriminatory power is the performance criterion of interest.We additionally computed the optimal C-index resulting from the true marker combination η µ with known coefficients.The values of the true C-index on the test samples are on average only slightly better than the ones of boosting the concordance index (median ĈUno = 0.778 -see Table 1).
To evaluate the possible effects of different sample sizes and censoring rates we modified the mean censoring time leading to approximate censoring rates of 30% and 70% and generated training samples of size n = {50, 200, 500}.Results are included in Table 1.As expected, the C-index resulting from our framework increases as censoring rates become small (median ĈUno = 0.820, range = 0.736-0.858)and decreases in settings with a large proportion of censored observations (median ĈUno = 0.668, range = 0.421-0.776).However, the same effect can be observed for the true C-index resulting from the true marker combination η µ (30% censoring ĈUno = 0.830, 70% censoring ĈUno = 0.690).For larger training samples, the variance of the coefficient estimates decreases (see Figure 1).As a result, the discriminatory power resulting from our boosting algorithm improves (for n = 500, median ĈUno = 0.778, range = 0.614-0.818)and gets nearly as good as the true C-index ( ĈUno = 0.781).This finding further confirms the ability of our approach to find the most optimal marker combination possible -see Figure 2. Note that also the true C-index differs slightly between the different sample sizes, as the training sample enters in ĈUno via the Kaplan-Meier estimator ĜL n (t).To investigate the effect of the smoothing parameter inside the sigmoid function, we additionally applied our boosting procedure for every simulation setting with different values of σ.The resulting estimates ĈUno are presented in Table 2. Compared to the effects of the sample size or the number of pre-selected markers p * , the smoothing parameter σ only seems to have a minor effect on the performance of our algorithm.In light of these empirical results, we recommend to apply a fixed small value (e.g., σ = 0.1, which is also the default value in the Cindex() family for the mboost package [38] -see the Appendix).
For both approaches to fit penalized Cox regression (Cox lasso, Cox ridge), we applied the R add-on package penalized [43].In order to evaluate ĈUno , we used the UnoC() function implemented in the survAUC package [44].

Applications to predict the time to distant metastases
In the next step, we further analyzed the performance of our gradient boosting algorithm in two applications to estimate and evaluate the optimal combination of pre-selected biomarkers.All markers are used to predict the time to distant metastases in breast cancer patients.As in the simulation study, we compared the results of our proposed algorithm to Cox regression with lasso and ridge penalization.Additionally, we considered four competing boosting approaches for survival analysis, which do not directly optimize the C-index.estimated via gradient boosting, while the other three are parametric accelerated failure-time (AFT) models assuming a Weibull, log-normal or log-logistic distribution [45,46].For all boosting approaches (Weibull AFT boosting, loglog-AFT boosting and Cox boosting) we used the corresponding pre-implemented functions of the mboost package.To ensure comparability, we used the same linear base-learners as described above for all boosting approaches.To ensure that the combined predictor did not only work on the data it was derived from but also on "external" validation data, we carried out a subsampling procedure for both data sets to generate B = 100 different learning samples {( T L i , ∆ L i , X L i ), i = 1, . . ., n} and test samples {( Tj , ∆ j , X j ), j = 1, . . ., N }, respectively.Consequently, we randomly split the corresponding data sets to use 2/3 of the observations as learning sample in order to optimize the biomarker combination η.To ensure an equal distribution of patients with and without an observed development of distant metastases we applied stratified sampling.Correspondingly, the 1/3 of the observations not included in the learning sample were used to evaluate the resulting predictions via the C-index.As a result, for every data set and every method we computed 100 different combinations η and 100 corresponding values of ĈUno .

Breast cancer data by Desmedt et al.
Desmedt et al. [1] collected a data set of 196 node-negative breast cancer patients to validate a 76-gene expression signature developed by Wang et al. [10].The signature, which is based on Affymetrix microarrays, was developed separately for estrogen-receptor (ER) positive patients (60 genes) and ER-negative patients (16 genes).In addition to the expression levels of the 76 genes, four clinical predictor variables were considered (tumor size, estrogen receptor (ER) status, grade of the tumor and patient age).The data are publicly available on GEO (http://www.ncbi.nlm.nih.gov/geo,accession number GSE 7390).Similar to Wang et al. [10], we used the time from diagnosis to distant metastases as pri-mary outcome and considered the 76 genes together with the four clinical predictors.Observed metastasis-free survival ranged from 125 days to 3652 days, with 79.08% of the survival times being censored.The main results of our analysis are presented in Figure 3.As expected, the unified framework to estimate and evaluate the optimal marker signature based on the C-index is not only methodologically more consistent than the Cox and AFT approaches, but also leads to to marker signatures that show a higher discriminatory power on external or future data (median ĈUno = 0.736, range = 0.467-0.854).As discussed in the methodological section, it is crucial to evaluate the discriminatory power on external data: the estimated C-index on the training sample was more than 35% higher (median ĈUno = 0.986) and hence extremely over-optimistic [25,40].Considering the interpretation of the resulting coefficient estimates for the clinical predictors, it is crucial to note that boosting methods for the C-index and the AFT models yield biomarker combinations η * where larger values indicate longer predicted survival times.On the other hand, classical Cox regression models rely on the hazard; higher values are hence associated with smaller survival times.If this is taken into account, results from the different approaches were in fact very similar.Both age of the patients and size of the tumor had a negative effect on the time to recurrence for all seven approaches.The same holds true for the tumor grade poor differentiation which resulted in a negative effect compared to good differentiation and intermediate differentiation.A positive ER status, on the other hand, was associated with a larger metastasis-free survival in all approaches.Regarding the coefficients of the 76 genes, results from our approach to boost the C-index were highly correlated to the ones of the other four boosting approaches (which rely on the same base-learners) -absolute correlation coefficients computed from the 100 subsamples ranged from 0.77 to 0.99.Also coefficients resulting from the popular ridge-penalized Cox regression were highly correlated with the ones from our approach -absolute correlation coefficients ranged from 0.47 to 0.84.

Breast cancer data by van de Vijver et al.
The second data set consists of 144 lymph node positive breast cancer patients that was collected by the Netherlands Cancer Institute [2].The data set, which is publicly available as part of the R add-on package penalized [43], was used by van de Vijver et al. [2] to validate a 70-gene signature for metastasis-free survival after surgery developed by van't Veer et al. [47].In addition to the expression levels of the 70 genes, the data set contains five clinical predictor variables (tumor diameter, number of affected lymph nodes, ER status, grade of the tumor and patient age).Observed metastasis-free survival times ranged from 0.055 months to 17.660 months, with 67% of the survival times being censored.Resulting values of the C-index of the new approach and the six considered competitors are presented in Figure 3.The improvement from applying the proposed unified framework compared to boosting the Cox proportional hazard model or applying ridge-penalized Cox regression was much less pronounced than in the previous data set.However, on average, boosting the C-index still led to the best combination of markers regarding the discriminatory power (median ĈUno = 0.662, range = 0.257-0.836).Interestingly, as in the previous data set, the lasso penalized Cox regression was clearly outperformed by the ridge-penalized competitor (which has been suggested for this specific data set by van Houwelingen et al. [48]).Furthermore, the ridge- penalized approach performed at least as good as the considered boosting approaches (except the new approach to boost the C-index).As in the previous data set, we again additionally evaluated the C-index on the training sample in order to assess the resulting over-optimism.As expected, the estimated C-index on the training sample was extremely biased (median ĈUno = 0.973).The resulting coefficients for the clinical predictors were again comparable for the seven different approaches.A positive ER status was associated with a larger metastasis-free survival for all seven approaches, the same also holds true for the age of the patient.On the other hand, the size of the tumor, the number of affected lymph nodes and a poor tumor grade resulted for all approaches in a negative effect on the survival time.Regarding the coefficients of the 79 genes, the highest correlation could again be observed for the boosting algorithms: Absolute correlation coefficients obtained from the 100 subsamples ranged from 0.64 to 0.95.Correlation between coefficients resulting from our approach to boost the C-index and the ones from ridge-penalized Cox regression was slightly lower, it ranging from 0.30 to 0.82.

Discussion
In this article we have proposed a framework for the development of survival prediction rules that is based on the concordance index for time-to-event data (C-index).As the C-index is an easy-tointerpret measure of the accuracy of survival predictions (based on clinical or molecular data), it has become an important tool in medical decision making.Generally, the focus of the C-index is on measuring the "discriminatory power" of a prediction rule: It quantifies how well the rankings of the survival times and the values of a biomarker (or marker combinations) in a sample agree.In particular, the C-index is methodologically different from measures that evaluate how well a prediction rule is "calibrated" (i.e., from measures that quantify "how closely the predicted probabilities agree numerically with the actual outcomes" [49]).Specifically, prediction rules that are well calibrated do not necessarily have a high discriminatory power (and vice versa).While several authors have proposed the use of the C-index for feature selection [9] and the evaluation of molecular signatures [21,22], the main contribution of this paper is a new approach for the derivation of marker combinations that is based directly on the C-index.Consequently, when using the proposed method, it is no longer necessary to rely on traditional methods such as Cox regression -which focus on the derivation of well-calibrated prediction rules instead of well-disciminating prediction rules and may therefore be suboptimal when the optimization of the discriminatory power is of main interest.Conceptually, our approach is in direct line with recent articles by Ma and Huang [34], Wang and Chang [50] and Schmid et al. [51] who developed a set of algorithms for the optimization of discrimination measures for binary outcomes (such as the area under the curve (AUC) and the partial area under the curve and (PAUC)).Because the C-index is in fact a summary measure of a correspondingly defined AUC measure for time-to-event data [26], our optimization technique relies on similar methodological concepts, such as the application of boosting methods and the use of smoothed indicator functions.A possible future extension of our approach might be to include the task of selecting the most influential genes in the proposed boosting algorithm.While our simulation study and the breastcancer examples were based on the pre-selection of genes, the proposed boosting method could also be applied directly to high-dimensional molecular data, so that Tasks 1 and 2 are effectively combined.This can be accomplished by optimizing the stopping iteration so that only a (lowdimensional) subset of the candidate genes is incorporated in the resulting marker combination ("early stopping", cf.[37]).Further research is warranted on the issues of early stopping and automated feature selection in the case of boosting the concordance index for survival data.The results of our empirical analysis suggest that the new approach is competitive with state-ofthe-art methods for the derivation of marker combinations.As demonstrated in the Numerical Results section, the resulting marker combinations are not only easy to compute and have a meaningful interpretation but can also lead to a higher discriminatory power of the resulting gene signatures.

Appendix Gradient boosting of the C-index
The concept of boosting was first introduced in the field of machine learning [52,53].The basic idea is to boost the accuracy of a relatively weak performing classificator (termed "baselearner") to a more accurate prediction via iteratively applying the base-learner and aggregating its solutions.Generally, the concept of boosting leads to a drastically improved prediction accuracy compared to a single solution of the base-learner [54].This basic concept was later adapted to fit statistical regression models in a forward stagewise fashion [33,35].One of the main advantages of this approach is the interpretability of the final solution, which is basically the same as in any other statistical model [24].This can not be achieved with competing machine learning algorithms as Support Vector Machines [55] or Random Forests [56].Specifically, the boosting approach can be used to develop prediction rules for survival outcomes [45,46,57].Although there exist also likelihood-based approaches for boosting [58], we will focus here on gradient-based boosting [24] as it is the better fitting approach for boosting the distribution-free C-index.The most flexible implementation of gradient boosting is the mboost [38] add-on package for the Open Source programming environment R [39].The mboost package contains a large variety of different pre-implemented base-learners and loss functions, that can be combined by the user via different fitting functions.For a tutorial on the how to apply the package for practical data analysis, see [59].To apply gradient boosting to optimize linear biomarker combinations w.r.t. the C-index in the version of Uno et al. [18], it is necessary to specify the newly developed Cindex

Figure 1 :
Figure 1: Coefficient estimates for p * = 5 pre-selected markers obtained from 100 simulation runs.The marker combinations were optimized via gradient boosting based on training samples of size n = 100 (left) and n = 500 (right).Boxplots represent the empirical distribution of the resulting coefficients.Only markers X 1 to X 4 had an actual effect on the survival time.

Figure 2 :
Figure 2: Simulation results for the discriminatory power obtained via the proposed C-index boosting approach and competing Cox-based estimation schemes.The marker combinations were optimized via the different approaches based on training samples of size n = 100 (left) and n = 500 (right).Boxplots represent the empirical distribution of the resulting ĈUno on corresponding test samples.The dotted line corresponds to the discriminatory power resulting from the true combination of predictors with known coefficients.

Figure 3 :
Figure 3: Comparing the discriminatory power of biomarker combinations to predict the time to distant metastases resulting from the proposed C-index boosting approach with competing estimation schemes.The plot on the left refers to the Desmedt et al. data, whereas the plot on the right presents results from the van de Vijver et al. data.All biomarker combinations were optimized via the corresponding algorithms based on the same 100 learning samples.Boxplots represent the empirical distribution of the resulting ĈUno on corresponding test samples.The dotted line corresponds to the median C-index resulting from the new approach.

Table 1 :
Results of the simulation study.Comparison of the discriminatory power resulting from boosting the C-index with competing approaches.Numbers refer to the median value and interquartile range (in parentheses) of the final ĈUno on 100 simulation runs.The true C-index refers to the discriminatory power resulting from the true combination of predictors with known coefficients.The amount of pre-selected genes is denoted as p * , n is the size of the training samples and cens.refers to the censoring rate.
The first is classical Cox regression,

Table 2 :
Simulation results for different values of the smoothing parameter.Comparison of the discriminatory power resulting from the gradient boosting approach when applying different values of the smoothing parameter σ.Numbers refer to to the median value and interquartile range (in parentheses) of the final ĈUno on 100 simulation runs.The amount of pre-selected genes is denoted as p * , n is the size of the training samples and cens.refers to the censoring rate.We recommend to use the value σ = 0.1, which is also the default value of the new Cindex family for the R add-on package mboost.