OSCAR: Optimal subset cardinality regression using the L0-pseudonorm with applications to prognostic modelling of prostate cancer

In many real-world applications, such as those based on electronic health records, prognostic prediction of patient survival is based on heterogeneous sets of clinical laboratory measurements. To address the trade-off between the predictive accuracy of a prognostic model and the costs related to its clinical implementation, we propose an optimized L0-pseudonorm approach to learn sparse solutions in multivariable regression. The model sparsity is maintained by restricting the number of nonzero coefficients in the model with a cardinality constraint, which makes the optimization problem NP-hard. In addition, we generalize the cardinality constraint for grouped feature selection, which makes it possible to identify key sets of predictors that may be measured together in a kit in clinical practice. We demonstrate the operation of our cardinality constraint-based feature subset selection method, named OSCAR, in the context of prognostic prediction of prostate cancer patients, where it enables one to determine the key explanatory predictors at different levels of model sparsity. We further explore how the model sparsity affects the model accuracy and implementation cost. Lastly, we demonstrate generalization of the presented methodology to high-dimensional transcriptomics data.


Restricting the number of kits in OSCAR
In many applications, some predictors are typically always measured together meaning that they belong to the same kit. This is typical, for example, in medicine since many medical tests yield values for more than one predictor. Thus, sometimes instead of restricting the number of nonzero coefficients it may be more useful to limit the number of used kits and allow only the coefficients in the selected kits to be nonzero. In this subsection, we generalize the cardinality constraint to take into account the kit structure.
We assume that in the model we have altogether p single predictors, k ≤ p different kits and that each predictor belongs to exactly one kit. To incorporate the kit structure, we define a kit matrix W ∈ R k×p with elements (W ) i,j = 1, if the predictor j belongs to the kit i 0, otherwise.
Thus, the kit matrix W is a (0, 1)-matrix and its row i describes the predictors in the kit i. In addition, we can guarantee that k i=1 (W ) i,j = 1 holds for each predictor j ∈ {1, . . . , p} since a predictor is only included in one kit. Remark 1. Sometimes a predictor may belong to more than one kit (i.e. if k i=1 (W ) i,j > 1 for some j) and this type of a situation violates the assumption that a predictor is only included in one kit. However, the problem can always be modified such a way that this assumption holds. In the modification, we first calculate for each predictor j the number of kits k j where it belongs. If k j > 1 we divide the original coefficient β j linked to the predictor j in the model (e.g. the scaled log partial likelihood) to k j parts by writing β j = β j,1 + . . . + β j,kj . Due to this, we have altogether p = k i=1 p j=1 (W ) i,j predictors in the final optimization problem. In addition, each kit containing the coefficient β j of the original predictor j is linked to exactly one of the coefficients β j,i , i = 1, . . . , k j . This enables us to write the original matrix W ∈ R k×p as W ∈ R k×p where each column has only one nonzero element.
Next, we define a mapping g : where β ∈ R p and W ∈ R k×p . For the used kit matrix, the result of this mapping is a vector containing the sums of the absolute values of the coefficients belonging to each kit. Example 1. Assume that we consider a problem with six predictors (i.e. β ∈ R 6 ). In addition, we have three kits and the kit matrix is of the form Thus, for example, predictors j = 1, 2, 3 are available in the same kit (e.g. hematocrite, hemoglobin, and red blood cell count). Similarly predictors j = 4 and j = 5 belong to the same kit and the predictor j = 6 is measurable alone. For this selection, we obtain If we decide to restrict the number of kits to be K ∈ {1, . . . , k}, then the cardinality constraint can be written as Similarly to the case with single predictors (see subsection "Restricting the number of model features"), we can use the largest-k norm for g(β, W ) to rewrite the cardinality constraint in an equivalent form Since each predictor j ∈ {1, . . . , p} belongs to exactly one kit we have Thus, for the selected number of kits K ∈ {1, . . . , k} this yields us the cardinality-constrained problem (see (5) in the main paper) of the form where l is the scaled log partial likelihood (see (2) in the main paper). It is worth to notice, that the problem for single predictors is a special case of (S1). It can be obtained by selecting the kit matrix W coinciding with the p × p identity matrix I meaning that each kit contains only one predictor. To solve the problem (S1), we again use the penalty function approach [1,2] allowing us to rewrite (S1) as an unconstrained minimization problem

2/13
where ρ > 0 is a positive penalization parameter. The objective f is still DC (Difference of two Convex functions) and the DC presentation f = f 1 − f 2 is easily constructed by selecting the convex functions Due to this, DBDC can also be utilized to solve the scaled log partial likelihood with the kit structure. An interesting feature of the kit penalized reformulation (S2) is that it can also be seen as a modification of the l 1 -penalty since the only difference is the largest-k norm term −ρ|||g(β, W )||| [K] . Note that this is the term restricting and controlling the upper bound for the number of nonzero kits in the problem. For the kit structure, OSCAR is presented in Algorithm A. Note that Step 1 generates starting points based on the kit structure. Otherwise the algorithm is really similar to the one with single predictors (see Algorithm 1 in the main article) and with the selection W = I the execution is identical. In addition, when we solve the scaled log partial likelihood allowing only K max kits to be used, we obtain as a by-product a solution also for each cardinality-constrained problem with a smaller number of used kits. It is also possible to execute OSCAR without fixing beforehand the number of kits by setting K max = k. In this case, the method provides a solution for each possible number of kits.

Acceleration procedure for high-dimensional data
In high-dimensional data, the number of features p is larger than the number of observations n. Next, we introduce a new acceleration procedure used to boost calculations, in particular, in high-dimensional data. It is needed because in high-dimensional data, the previously described heuristic to generate initial points is not the best possible one. The reason for this is that the more high-dimensional the data is, the more time-consuming it is to solve the problem (S1) presented in the previous section. Thus, it may require a lot of computational time to solve the high-dimensional problem (S1) from each of the starting points during each iteration of OSCAR. Furthermore, in high-dimensional data most of the features are irrelevant and do not contain any useful information. Due to this, all of the features or kits should not be used to construct the starting points or be included in the problem (S1).
For the above mentioned reason, we have boosted calculations in high-dimensional data by presenting an acceleration procedure utilizing a low-dimensional convex subproblem and a reduced-sized version of the original problem (S1). The low-dimensional subproblem is fast to be solved and it is used to spot the most promising features or kits together with starting points linked to them. After the most promising features or kits are spotted we generate based on them the reduced-sized version of the original problem (S1) and solve it from the obtained starting points. Since we do not solve the original high-dimensional problem (S1) but its reduced-sized version we are able to significantly accelerate the computations. In addition, the user can affect the size of the reduced problem and how many starting points are used. More detailed description of the acceleration procedure is given below. Since the procedure is presented for the problem (S1) with kit structure the case with single predictors is its special case.
Assume that we are starting the execution of the iteration K + 1 of the OSCAR method. Thus, we have already obtained a solution β * K for the problem (S1) with K kits (if K = 0 then β * 0 = 0). In addition, we know that K < k ≤ p, where k is the maximum number of the kits and p is the maximum number of single predictors. First, we collect to the sets C K and F K the indices of the kits {1, 2, . . . , k} and features {1, 2, . . . , p}, respectively, already used in the solution vector β * K . The reduced solution vector is denoted by β * F K ∈ R |F K | and it just contains the predictors of β * K ∈ R p which are nonzero.

3/13
Input: The values of features x i , the survival times y i , the labels δ i ∈ {0, 1}, the kit matrix W and the maximal number of kits K max ∈ {1, 2, . . . , k} until which the cardinality-constrained problem is solved.
Output: For K = 1, . . . , K max , gives the solution β * K for the cardinality-constrained problem with K kits.
Step 0: (Initialization) Solve the scaled log partial likelihood model without any regularization (see (3) in the main article) with DBDC or LMBM and denote the solution byβ. Set β * 0 = 0 and K = 1.
Step 1: (Starting points) For the cardinality-constrained problem with K kits, initialize the set of starting points S K = ∅. For j = 1, . . . , k construct the point β j 0 with the formula and if g(β j 0 , W ) 0 > K − 1 then add the point to the set S K .
Step 2: (Penalty function problem) Do the following steps A-C for all β j 0 ∈ S K to obtain solutions β *

K,j
Step A: Select a positive initial value for the penalization parameter ρ.
Step B: Solve the problem (S2) with the DBDC or LMBM method starting from β j 0 and denote the solution withβ j .
Otherwise increase the value of the penalization parameter ρ and go to Step B.
Step 3: (Solution) Select the best solution β * K for the cardinality-constrained problem (S1) with K kits using the formula Update K = K + 1. If K ≤ K max , then go to Step 1. Otherwise go to Step 4.
Step 4: Return β * K for all K = 1, . . . , K max . Algorithm A: OSCAR with the kit structure In order to find promising starting points during the iteration K + 1, the idea is to take the previous solution β * K as a base and test one by one how each missing kit would improve it. Note however, that we are only optimizing the values of the predictors in the added kit. Therefore, the nonzero predictors of β * K are not changed and their values are treated as constants (this corresponds to the vector β * F K ). Due to this, we look through each kit h from the set {1, . . . , k} \ C K . After the kit h is selected we first define for it the set F h containing predictors in the kit and the corresponding solution vector β F h ∈ R |F h | . The value for the predictor vector β F h is obtained by solving the low-dimensional convex subproblem

4/13
where a j = (x F K j ) β * F K for j = 1, . . . , n are the constant values obtained based on the previous solution. In addition, both observation vectors x F K j and x F h j contain only those features from x j which are found from the sets F K and F h , respectively. Thus, in the subproblem we solve the scaled log partial likelihood (see subsection "Cox's proportional hazards model"), where we fit the kit h among the previously obtained nonzero predictors β * F K . One nice feature of the subproblem (S3) is that whenever we substitute the predictors F h of β * K with values in the solution vector β * F h we obtain a starting point with K + 1 kits. Another useful property is that when we selectβ F h = 0 the value f (β F h ) coincides with the objective function value obtained with the previous solution β * K with K kits. Due to this, we can guarantee that a starting point, where features F h of the solution β * K are replaced with β * F h , always gives a smaller objective function value for the problem (S1) with K + 1 kits than the previous solution β * K . Note also that whenever only single predictors are used the convex subproblem (S3) is always one-dimensional.
Therefore, when the acceleration procedure is used Step 1 of Algorithm A is substituted with the following: "First, for each kit h ∈ {1, . . . , k} \ C K solve the subproblem (S3) and constitute a starting point as previously discussed. Second, select a set of starting points S K to be used in Step 2 of Algorithm A." Before the selection of the starting points we have k − K different options. However, we do not want to use all of them due to the high-dimensional data. Thus, the user can select a value for the parameter γ ∈ (0, 1], which presents the percent of starting points used and we select only γ(k − K) of them. The selection is done based on the objective function values obtained for the subproblem (S3), since only the best γ(k − K) starting points in terms of the objective function values are used.
Finally, when we move on to Step 2 of Algorithm A we first reduce the size of the original problem (S1). This reduced problem contains only those kits which were already used in the previous solution β * K together with the ones constructing the starting points selected in Step 1. This means that in the reduced version of the problem (S1) we have altogether K + γ(k − K) kits instead of the original k ones. Thus, the only difference in the reduced version of the problem (S1) is that we just leave out part of the kits together with the predictors which belong to these kits.

Approximated Pareto-fronts a) TYKS -OSCAR b) TYKS -LASSO c) TYKS -APML0
k=1 k=2 k=4 k=9 k=11 k=12  Fig 3b in the main article). For each λ, the notation k λ marks the number of predictors in the model fitted for the entire training data. The curatedPCaData R-package [11] (version 0.9.42) was used for harmonizing the input data, along with extracting key clinical metadata such as recurrence information. We limited the transcriptomics data here to just primary tumor samples from patients with non-metastatic disease subtype.

Model validation details
For model validation of the high-dimensional transcriptomics data, the TCGA full dataset was randomly split into 3/4 teaching data and 1/4 held-out validation data. Genes were filtered based on a criterion that over half of the samples were required to have a unique expression value after processing. An intersect of common gene names was taken after this filtering, resulting in a final dimensionality of p = 10 253 genes. Per each gene, the expression values were z-score transformed to make model coefficients comparable across studies. All models were fit to the teaching dataset of TCGA (n = 303) for the common gene symbol intersection.
The latest oscar R-package from GitHub was used for all transcriptomics analyses, with the LMBM solver and the acceleration parameter γ = 2 × 10 −4 . For LASSO penalized models, the glmnet R-package [12] (version 4.1-4) was used. The APM-L 0 R-package [13] (version 0.10) with 10/13 default parameters and the ncvreg R-package [14] (version 3.13.0) with SCAD penalty were used for further benchmarking. The greedy forward selection was implemented using coxph from the survival R-package [15] (version 3.3-1) and the code is provided in the oscar GitHub at './data-raw/greedy.R'. The code used for running model validations is provided at './data-raw/validate.R'.