Context Sensitive Modeling of Cancer Drug Sensitivity

Recent screening of drug sensitivity in large panels of cancer cell lines provides a valuable resource towards developing algorithms that predict drug response. Since more samples provide increased statistical power, most approaches to prediction of drug sensitivity pool multiple cancer types together without distinction. However, pan-cancer results can be misleading due to the confounding effects of tissues or cancer subtypes. On the other hand, independent analysis for each cancer-type is hampered by small sample size. To balance this trade-off, we present CHER (Contextual Heterogeneity Enabled Regression), an algorithm that builds predictive models for drug sensitivity by selecting predictive genomic features and deciding which ones should—and should not—be shared across different cancers, tissues and drugs. CHER provides significantly more accurate models of drug sensitivity than comparable elastic-net-based models. Moreover, CHER provides better insight into the underlying biological processes by finding a sparse set of shared and type-specific genomic features.


Details on Estimation of Prior
Similar as the penalty of features in regression (Eq. 4 in the main text), the penalty of the relevant subtype t for phenotype y corresponds to where !" represents the frequency of subtype t being chosen as the relevant subtype for phenotype k.

Search Algorithm
To optimize Eq. 3 in the main text, we adapted the greedy search algorithm proposed by (1). We provide the outlines of the algorithm here to illustrate our search strategy.
Pseudo-codes are also provided in this document.
We use a forward-backward search strategy to select feature. That is, our search algorithm starts with forward search, adding one feature at a time if addition of the feature decreases the score defined by the objective function. Once the forward search finishes, we use backward search, removing one feature at a time if removal of the feature decrease the score. The search space includes all the genomic features and the context-gene interactions terms.

Forward search
During forward search, we evaluate the addition of each feature for each phenotype.
Specifically, we evaluate the score of adding feature x to the current selected feature set where ! and ! correspond to the regression coefficient for X c and x, respectively. The feature that results in the best (minimum) score is considered to be added to X c if the resulting score is better than the previous score (score(X c )). The penalty of each feature is -log2(p)+4 during the first iteration of CHER and --log2( ! ! ≠ 0 )+4 after the first iteration where ! ! ≠ 0 represents the prior estimated from the previous iteration.
The constant 4 represents the coding cost to specify the regression coefficient (2 bits) (1) and the cost to specify whether this feature is context-specific or not (2 bits Moreover, at the end of forward search, if the current model contains no context-specific features, we re-evaluate each feature in the current model to see whether any of them can be context-specific. Again, this is to avoid local minimum. We only make the features context-specific if it results in better score. If indeed any feature is made context-specific, we resume the forward search to see if we can add any other features to the model.

Backward search
During backward search, we evaluate removal of feature x from current feature set X c by

Eq. S3
If the score is better than score (X c ), we then remove x from current feature set. The backward search iterates until no removal of feature results in better score.

Equivalent-model testing
Sometimes the search algorithm may lead to a model that is more complicated than necessary. Specifically two cases may happen: (1) Feature x is selected as a shared and a context-specific feature, but with the (2) Feature x is selected a context-specific feature for both subtypes, but with the same signs of regression coefficients Case (1)  Similarly, Case (2) may suggest feature x have different effects on two groups, but another possibility is that the feature in fact has the same effects on both groups For the latter, we can remove both context-specific features and use ! ! to model the effect.
Hence, we constantly check if such two cases occur during our greedy search. If detected, we explicitly test which case it should be so that we can obtain the simpler and hence less costly model.

Simulation
We generated a synthetic dataset to test the proposed algorithm. In order to generate the data, the algorithm is first applied to CCLE blood cancer samples (n=70) where seven binary subtypes are defined. Mutation and gene expression features are used and 100 bootstrapping datasets are drawn to train the models for activity area of each drug. The whole training and bootstrapping procedure is repeated for ten iterations, and the priors are updated after each iteration. 0.15 is used as a frequency cutoff to select features after the final iteration. One phenotype did not have any features passing the threshold, so it was dropped from further analysis.
Of 23 phenotypes, 6 are set to have only shared features among all samples. That is, no subtype-specific features are used to simulate the sensitivity data for the 6 drugs. This is to test if the algorithm would only select shared features or if it would be biased to select subtype-specific features. The number of features used to synthesize phenotypes ranges from 2 to 11, excluding intercepts. After the features are selected, the coefficients are estimated from the original data using ridge regression with = 0.01:

Eq. S4
The choice of ridge regression is to regularize the coefficients of collinear features. These coefficients are then used with the original features to simulate synthetic phenotype.
Gaussian-distributed noise is added to the simulated phenotypes. The noise is drawn from In order to evaluate the algorithm, four metrics are calculated: split accuracy, precision recall and F-measure. Split accuracy is defined as: where ! is the relevant subtype retrieved for phenotype k and ! is the true relevant subtype used to synthesize the phenotype.
Precision and recall are defined for each synthetic phenotype k: where is the threshold for relevant frequency, !" is the relevant frequency of feature j, and ! ! is the coefficient used to synthesize the phenotype. Here . ! includes both shared and subtype-specific features. Intuitively, precision measures how many of the selected features are truly relevant whereas recall indicates how many relevant features are recovered. There is usually a trade-off between precision and recall. For example, a lenient algorithm can select as many features as possible to achieve high recall, but its precision would be low. In biology, high precision is often desired since it implies low false positive rate.

Effects of Transfer Learning and bootstrapping
To illustrate the effect of transfer learning, S4 and S5 Figs selecting the parameter alpha, which controls the ratio of L1 to L2 norm penalty (alpha=1 means no L2 norm penalty is used). The second parameter lambda, which controls sparsity, is chosen from the smallest to the largest possible values that give nonempty models. 100 bootstrapped runs of the elastic net are applied to each synthetic phenotype with the optimal parameters. All seven possible subtypes are treated as binary features and included for model training. For evaluation, we calculate precision and recall for the elastic net models. Since the elastic net does not select subtype-specific features, we simply evaluate the models regardless of subtype specificity. Fig. 2 shows the comparison of the three metrics obtained from the CHER algorithm and the elastic net. CHER achieves higher precision than elastic net for most models at the first iteration (top panels). CHER's further improvement over the elastic nets can be seen in precision and F-measure after ten iterations of transfer learning (bottom panels).
However, in exchange for high precision and hence low false positive rate, the CHER is more conservative than the elastic net. As a result, its recall is not as good as the elastic net. This is also due to the fact that the elastic net has the capability to select collinear features. Since the features used including gene expression, highly correlated features are expected. These results suggest the elastic net selects many more correlated features than the CHER algorithm to achieve high recall, and therefore suffers from low precision.
Nevertheless, CHER outperforms the elastic nets in F-measure (harmonic mean of precision and recall), suggesting the overall better retrieval of the correct features.
The results from the simulation validate the correctness and robustness of the proposed algorithm. It also sheds lights on the effects of bootstrapping and transfer learning. These results suggest the algorithm is relatively conservative due to the sparsity constraints.