^{1}

^{2}

^{1}

^{1}

^{3}

^{4}

^{2}

The authors have declared that no competing interests exist.

Prediction of clinical drug response (CDR) of cancer patients, based on their clinical and molecular profiles obtained prior to administration of the drug, can play a significant role in individualized medicine. Machine learning models have the potential to address this issue but training them requires data from a large number of patients treated with each drug, limiting their feasibility. While large databases of drug response and molecular profiles of preclinical

Cancer is among the leading causes of death globally and prediction of the drug response of patients to different treatments based on their clinical and molecular profiles can enable individualized cancer medicine. Machine learning algorithms have the potential to play a significant role in this task; however, these algorithms are designed based on the premise that a large number of labeled training samples are available, and these samples are accurate representations of the profiles of real tumors. However, due to ethical and technical reasons, it is not possible to screen humans for many drugs, significantly limiting the size of training data. To overcome this data scarcity problem, machine learning models can be trained using large databases of preclinical samples (e.g. cancer cell line cultures). However, due to the major differences between preclinical samples and real tumors, it is unclear how accurately such preclinical-to-clinical computational models can predict the clinical drug response of cancer patients. Here, first we systematically evaluate a variety of different linear and nonlinear machine learning algorithms for this particular task using two large databases of preclinical (GDSC) and tumor samples (TCGA). Then, we present a novel method called TG-LASSO that utilizes a new approach for explicitly incorporating the tissue of origin of samples in the prediction task. Our results show that TG-LASSO outperforms all other algorithms and can distinguish resistant and sensitive patients for the majority of the tested drugs. Follow-up analysis reveal that this method can also identify biomarkers of drug sensitivity in each cancer type.

Cancer is one of the leading causes of death globally and is expected to be the most important obstacle in increasing the life expectancy in the 21^{st} century [

Various preclinical models of cancer have been developed to enable the study of cancer and its treatment in the laboratory. CCLs, which are 2D cell cultures developed from tumor samples, are one of the least expensive and most studied of these models. Recently, several large-scale studies have cataloged the molecular profiles of thousands of CCLs and their response to hundreds of drugs [

To this end, we first formed a computational framework to systematically evaluate the prediction accuracy of different computational methods. We obtained preclinical training samples from the Genomics of Drug Sensitivity in Cancer (GDSC) database [

Next, we developed a novel approach called Tissue-Guided LASSO (TG-LASSO) to explicitly include information on the tissue of origin of samples in the regularized regression model. This method outperformed all other approaches evaluated. Using this method, we showed that the CDR of cancer patients can be predicted using preclinical CCL training samples, for the majority of drugs. More specifically, out of 12 drugs, TG-LASSO separated resistant patients from sensitive patients for 7 drugs. In addition, for each tissue type and drug, TG-LASSO identified a small set of genes that may be used as tissue-specific biomarkers of drug response for each drug. We showed that genes selected by TG-LASSO for prediction of drug response are informative of patient survival when used as a gene signature, and also provide pathway-level insights into mechanisms of drug action. These results emphasize the clinical relevance of molecular profiles of preclinical samples cataloged in large-scale databases and demonstrate the importance of properly including information on the lineage of samples in follow-up analyses.

In this study, our first goal was to determine whether commonly used machine learning algorithms are capable of predicting the clinical drug response (CDR) in cancer patients using computational models trained only on cancer cell lines’ (CCLs) basal gene expression profiles (i.e. before administration of the drug) and their drug response. For this purpose, we identified 23 drugs (Supplementary

We formed a computational framework to systematically evaluate the prediction capability of different algorithms (

The input gene expression data (A) corresponding to cancer cell lines (training set) and patients’ tumors (test set) are first homogenized and their batch-effect is removed. The homogenized gene expression training data (B) and the cell lines’ log (IC50) values are used to train a regression model (C). The trained model is applied to gene expression profiles of patients’ tumors to predict their log (IC50) values, which are then used to evaluate the prediction performance (D).

We used a one-sided nonparametric Mann Whitney U test to determine whether the estimated log(IC50) values of resistant tumors (those with CDR of ‘clinical progressive disease’ or ‘stable disease’) are significantly larger than sensitive tumors (those with CDR of ‘partial response’ or ‘complete response’). One should note that due to the difference in the type of measured drug response in the training set (continuous-valued log(IC50)) and the test set (categorical CDR), such an approach is necessary and other measures of performance such as concordance index or mean squared error are not suitable. In this evaluation, we only used 12 drugs that had at least 2 tumor samples in each category of resistant or sensitive and had at least 8 total samples with known CDR.

The second column shows the properties of the algorithm (linear versus nonlinear; single task versus multi-task learning). The third column shows the number of drugs for which a statistically significant discrimination between resistant and sensitive patients was obtained (one-sided Mann Whitney U test). The fourth column shows the total number of drugs included in the evaluation, and the fifth column shows the combined p-value (using Fisher’s method) for all the drugs in the analysis.

Algorithm | Properties | Drugs with P<0.05 | Drugs | Combined P (Fisher) |
---|---|---|---|---|

LASSO | Linear, Single task | 5 | 12 | 5.21E-09 |

ElasticNet | Linear, Single task | 5 | 12 | 1.18E-08 |

MTL-LASSO | Linear, Multi-task | 5 | 12 | 3.64E-06 |

Ridge | Linear, Single task | 4 | 12 | 1.75E-05 |

MTL-ElasticNet | Linear, Multi-task | 3 | 12 | 4.83E-06 |

SVR (Linear Kernel) | Linear, Single task | 3 | 12 | 1.10E-05 |

SVR (Polynomial Kernel) | Nonlinear, Single task | 3 | 12 | 1.82E-05 |

SVR (RBF kernel) | Nonlinear, Single task | 3 | 12 | 2.92E-05 |

K-Nearest Neighbor | Nonlinear, Single task | 3 | 12 | 8.26E-05 |

Multi-Layer Perceptron | Nonlinear, Single task | 2 | 12 | 4.86E-02 |

Gelhar, et al. (2017) | Linear, Single task | 1 | 11 | 1.85E-02 |

Random Forest | Nonlinear, Single task | 1 | 12 | 0.19 |

The second column shows the p-value (one-sided Mann Whitney U test) for the predicted log (IC50) values of sensitive and resistant tumors. The third and fourth columns show the number of resistant and sensitive tumors used in the statistical test.

Drug | P-value (one-sided) | Num Resistant (PD or SD) | Num Sensitive (CR or PR) |
---|---|---|---|

bicalutamide | 0.34 | 3 | 14 |

bleomycin | 0.10 | 4 | 46 |

cisplatin | 6.67E-05 | 25 | 111 |

docetaxel | 0.98 | 17 | 55 |

doxorubicin | 3.42E-03 | 7 | 54 |

etoposide | 7.57E-04 | 10 | 71 |

gemcitabine | 0.14 | 43 | 37 |

paclitaxel | 0.62 | 28 | 74 |

sorafenib | 0.19 | 13 | 2 |

tamoxifen | 8.82E-03 | 4 | 14 |

temozolomide | 9.08E-02 | 84 | 11 |

vinorelbine | 2.10E-03 | 6 | 23 |

These results suggest several important points. First, consistent with the reports in [

The predicted log(IC50) values were labeled as resistant or sensitive based on the threshold that obtained the highest oddsratio [

Predicted Resistant | Predicted Sensitive | Total | |
---|---|---|---|

True Resistant | 23 | 2 | 25 |

True Sensitive | 58 | 53 | 111 |

Total | 81 | 55 | 136 |

Various studies have suggested that including information on the interaction of the genes (and their protein products) or their involvement in different pathways can improve the accuracy of different bioinformatics tasks [

The third column shows the number of drugs for which a statistically significant discrimination between resistant and sensitive patients was obtained (one-sided Mann Whitney U test). The fourth column shows the total number of drugs included in the evaluation, and the fifth column shows the combined p-value (using Fisher’s method) for all the drugs in the analysis. As a point of comparison, LASSO without the use of any network yielded p-value < 0.05 for five of 12 drugs, with combined p-value of 5.21E-09 (

Network | Algorithm | Drugs with P<0.05 | Drugs | Combined P (Fisher) |
---|---|---|---|---|

STRING PPI | NICK | 5 | 12 | 1.79E-06 |

GELnet | 5 | 12 | 1.21E-04 | |

SGL | 3 | 12 | 1.75E-05 | |

ssGSEA-LASSO | 3 | 12 | 6.94E-06 | |

STRING Co-Expression | NICK | 5 | 12 | 2.40E-06 |

GELnet | 5 | 12 | 1.07E-04 | |

SGL | 2 | 12 | 2.62E-04 | |

ssGSEA-LASSO | 4 | 12 | 1.40E-2 | |

STRING Text Mining | NICK | 5 | 12 | 2.14E-06 |

GELnet | 5 | 12 | 1.23E-04 | |

SGL | 3 | 12 | 7.95E-05 | |

ssGSEA-LASSO | 3 | 12 | 2.57E-02 | |

HumanNet Integrated Network | NICK | 5 | 12 | 1.09E-06 |

GELnet | 5 | 12 | 1.21E-04 | |

SGL | 3 | 12 | 7.69E-04 | |

ssGSEA-LASSO | 4 | 12 | 6.06E-07 |

Up to this point, we only used the tissue of origin of the preclinical and clinical samples

One of the most common methods of including the tissue of origin in regression analysis is introducing new binary features to each sample, representing whether the sample belongs to that tissue (‘1’) or not (‘0’) [

To overcome these issues, while explicitly incorporating information on the samples’ tissue of origin, we devised a new approach called

The steps of TG-LASSO are depicted for one tissue type. These steps are repeated for each tissue type. To predict the drug response of tumors corresponding to tissue t, the cell lines of the same lineage are identified (A). These cell lines are used as the validation set, while cell lines of all other lineages are used as the training set for hyperparameter tuning (B). The identified hyperparameter is used to train a tissue-dependent model using all the CCLs (C). The trained model is used to predict the drug response of tumors from tissue t (D). Since the hyperparameter is tuned in a tissue-dependent manner (B and C), the models trained for each tissue type are distinct (C and D).

This approach resulted in the best performance among all the methods tested, with 7 (out of 12) drugs showing significant discrimination between resistant and sensitive tumors (p<0.05) and a combined p-value (Fisher’s method for all 12 drugs) of 2.25E-10 (

A) The box plots reflect the distribution of estimated log (IC50) values using TG-LASSO for each group of resistant or sensitive patients. The p-values correspond to a one-sided Mann-Whitney U test. The Precision@20% (written as P@20%) is the precision of the method when samples with the predicted log (IC50) above 80th percentile of the training log(IC50) values are declared as resistant and those below the 20th percentile are declared as sensitive. B) The Precision@k% as a function of k.

The second column shows which subset of the training samples were used for training. The third column shows how tissue information was used. The fourth column shows the number of drugs for which a statistically significant discrimination between resistant and sensitive patients was obtained (one-sided Mann Whitney U test). The fifth column shows the total number of drugs included in the evaluation, and the sixth column shows the combined p-value (using Fisher’s method) for all the drugs in the analysis.

Algorithm | Training Samples | Tissue information | Drugs with P<0.05 | Drugs | Combined P (Fisher) |
---|---|---|---|---|---|

TG-LASSO | All samples | Used during hyperparameter tuning | 7 | 12 | 2.25E-10 |

Method 1 | All samples | Used as new binary features | 5 | 12 | 5.21E-09 |

Method 2 | Only samples matching the test samples’ tissue | Used to identify relevant training samples | 1 | 12 | 0.16 |

To better assess the ability of TG-LASSO in predicting whether a drug should be administered to a patient or not, we defined a measure which we called Precision@k%. Intuitively, this measure represents the precision of the method when test samples with predicted log(IC50) above the (100—k)th percentile of the training samples’ log(IC50)s are labeled as resistant and those below the kth percentile are labeled as sensitive (see ^{th} and 80^{th} percentiles of the training samples’ log (IC50) may be good thresholds for deciding whether a patient is sensitive or resistant to these drugs.

One interesting observation was that paclitaxel, the response of which could not be predicted accurately with the majority of methods reported in

Since some of the drugs used in our study were administered in combination with other drugs, we asked how well TG-LASSO predicts the CDR in such cases of treatment with drug combinations. For this purpose, we evaluated its CDR prediction for a drug only on patients for whom that drug was administered over a period overlapping their treatment with at least one other drug. We limited our analysis to 9 drugs with at least two samples (patients) in each group (sensitive and resistant) and with at least 8 samples in total. Supplementary

Next, we sought to evaluate the effect of batch-effect removal and preprocessing on the performance of TG-LASSO. For this purpose, we did not perform ComBat data homogenization or z-score normalization on the gene expression data. As expected, the performance of both TG-LASSO and LASSO deteriorated, with the former resulting in 4 drugs with p < 0.05 and the latter with only 3 (Supplementary

During its training phase, TG-LASSO automatically selects a subset of genes to be used in the regression model by tuning the hyperparameter

More importantly, the knockdown or overexpression of many of the identified genes has been shown to influence the sensitivity of cancer cells to these drugs. For example, the shRNA knockdown of CHI3L1, a gene identified for etoposide and cisplatin response in every tissue (but was not identified using LASSO for any of these drugs), has been shown to sensitize glioma cells to these two drugs, while its overexpression reduced their sensitivity [

We hypothesized that genes that were identified by TG-LASSO as response predictors of many drugs in a single tissue (Supplementary

Patients were clustered based on the expression of genes that were identified by TG-LASSO for more than 5 drugs in the corresponding tissue. The p-value was calculated using a log-rank test.

Next, we repeated the analysis above using genes identified by LASSO for more than 5 drugs as a benchmark (Supplementary

Since Kaplan-Meier analysis of LGG clusters obtained using TG-LASSO genes resulted in the smallest p-value (log-rank test, p = 7.61E-13), we sought to further characterize the identified genes that resulted in this significant patient stratification using functional and pathway enrichment analysis. For this purpose, we used the KnowEnG’s gene set characterization pipeline [

Several of the most significantly enriched GO terms were related to extracellular matrix (ECM), which plays an important role in the infiltration of glioma cells into the brain [

The enriched pathways included miRNA targets in ECM and membrane receptors (FDR = 2.0E-3) and Syndecan-1-mediated signaling (FDR = 0.04). Syndecan-1 is a cell surface heparan sulfate proteoglycan and its expression has been shown to be correlated with tumor cell differentiation in various cancers [

Ideally, a predictive model of CDR should be trained on data obtained directly from patients. Similarly, identification of biomarkers of drug sensitivity has the most potential clinical impact when based on patient data. However, since in practice most patients only receive the ‘standard of care’ treatment based on their specific cancer type, CDR data is scarcely available for the newly approved drugs or drugs that have not yet passed the clinical trial, limiting our ability to decipher the mechanisms of drug sensitivity for these drugs. An alternative approach is to train ML models on preclinical samples (e.g. CCLs) to predict the CDR of patients, then use these predictions to discover novel biomarkers and druggable targets.

Recent large-scale studies that have cataloged the molecular profiles of thousands of CCLs and their response to hundreds of drugs [

Another important factor that played an important role in the performance of the ML models was data homogenization and batch-effect removal. The performance of TG-LASSO and LASSO both deteriorated when we did not remove the existing batch-effect between the training dataset and the test dataset. In spite of this, TG-LASSO could distinguish between resistant and sensitive patients for four drugs, when applied to non-homogenized data. This suggests two approaches when dealing with scenarios in which new test samples arrive. The first approach is to simply use the model trained on non-homogenized preclinical samples and accept the worse performance. The alternative is to retrain the model every time a new test sample arrives. This allows for training and prediction on homogenized data, but significantly increases the computational cost. An alternative could be developing a new data homogenization and batch effect-removal method that only transforms the gene expression of the test samples (keeping the gene expression profiles of training samples unchanged) by mapping them to the subspace spanned by the training samples. However, the development of such a method is beyond the scope of this study.

We note that due to the major differences between CCLs and tumors (e.g. the greater heterogeneity of cells in a tumour compared to CCLs), obtaining more accurate results based on classical ML techniques may not be possible. The reason is that classical ML methods assume that the training samples and the test samples are drawn from the same or similar distributions. While batch-effect removal and other homogenization and normalization techniques help to alleviate this issue, more realistic preclinical models of cancer are necessary to significantly improve these results. Recent advances in developing human derived xenografts [

We obtained the gene expression profiles (FPKM values) of 531 primary tumor samples of TCGA patients who were administered any of the 23 drugs mentioned earlier. First, we removed genes that contained missing values. We also removed any gene that was not expressed (i.e. FPKM<1) for more than 90% of the samples. Then, we performed a log-transformation and obtained log2(FPKM+0.1) values for each gene. The resulting gene expression matrix contained 19,437 genes and 531 samples. We obtained the CDR of these patients from the supplementary files of [

To homogenize the gene expression data from these two datasets, we first removed genes not present in both datasets as well as genes with low variability across all the samples (standard deviation < 0.1), resulting in a total of 13,942 shared genes. Then, we used ComBat [

For the network-guided analyses, we downloaded four networks of gene interactions in humans from the KnowEnG’s knowledgebase of genomic networks [

The baseline models (

For the network-based algorithms (

In addition to the above methods that utilize the graph Laplacian of each network in the regression algorithm, we used sparse group LASSO (SGL). This method takes a collection of pathways as input and induces sparsity at both the pathway and the gene level to generate the input. We performed community detection on each of the networks in

Finally, we developed a heuristic method based on ssGSEA [

In the first approach (Method 1 in

In the second approach (Method 2 in

TG-LASSO is a method for predicting the CDR of tumors using the information in

During training, LASSO minimizes the objective function _{2} denotes the L2 vector norm, ∥ ∥_{1} denotes the L1 vector norm, and

Let

To further assess the performance of TG-LASSO, we defined a measure called Precision@k% (motivated by Precision@k in information retrieval). To define Precision@k%, we first used the log(IC50) values of the preclinical cell lines form GDSC to find the Kth percentile (K< = 50) and the (100-K)th percentile of each drug (separately), denoted as t_{K} and t_{100-K}, respectively. Then, given the predicted log(IC50) values of the tumors and their annotation as ‘sensitive’ or ‘resistant’ (based on their known CDR), we defined
_{100−K} is the number of resistant tumors whose predicted log(IC50) is larger than _{100−K}, _{K} is the number of sensitive tumors whose predicted log(IC50) is smaller than _{K}, _{100−K} is the total number of tumors whose predicted log(IC50) is larger than _{100−K}, and _{K} is the total number of tumors whose predicted log(IC50) is smaller than _{K}. Intuitively, this measure shows the precision of predicting the tumors with predicted log(IC50) values larger than t_{100-K} as resistant and those with predicted log(IC50) values smaller than t_{K} as sensitive. Note that due to this definition of Precision@k%, for some values of k, the denominator may be equal to 0 and the measure may not be defined.

We used the gene set characterization pipeline of KnowEnG analytical platform [

An implementation of TG-LASSO in python, with appropriate documentation and input files, is available at:

(DOCX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(XLSX)

(TIF)

Patients were clustered into two groups using the expression of genes identified by TG-LASSO for more than 5 drugs in each tissue type.

(TIF)