Predicting drug sensitivity of cancer cells based on DNA methylation levels

Cancer cell lines, which are cell cultures derived from tumor samples, represent one of the least expensive and most studied preclinical models for drug development. Accurately predicting drug responses for a given cell line based on molecular features may help to optimize drug-development pipelines and explain mechanisms behind treatment responses. In this study, we focus on DNA methylation profiles as one type of molecular feature that is known to drive tumorigenesis and modulate treatment responses. Using genome-wide, DNA methylation profiles from 987 cell lines in the Genomics of Drug Sensitivity in Cancer database, we used machine-learning algorithms to evaluate the potential to predict cytotoxic responses for eight anti-cancer drugs. We compared the performance of five classification algorithms and four regression algorithms representing diverse methodologies, including tree-, probability-, kernel-, ensemble-, and distance-based approaches. We artificially subsampled the data to varying degrees, aiming to understand whether training based on relatively extreme outcomes would yield improved performance. When using classification or regression algorithms to predict discrete or continuous responses, respectively, we consistently observed excellent predictive performance when the training and test sets consisted of cell-line data. Classification algorithms performed best when we trained the models using cell lines with relatively extreme drug-response values, attaining area-under-the-receiver-operating-characteristic-curve values as high as 0.97. The regression algorithms performed best when we trained the models using the full range of drug-response values, although this depended on the performance metrics we used. Finally, we used patient data from The Cancer Genome Atlas to evaluate the feasibility of classifying clinical responses for human tumors based on models derived from cell lines. Generally, the algorithms were unable to identify patterns that predicted patient responses reliably; however, predictions by the Random Forests algorithm were significantly correlated with Temozolomide responses for low-grade gliomas.

this approach has some limitations. For example, there were as few as 2-3 samples available for some of the cell types. We address this issue in the Discussion now. Regardless, as shown in our updated manuscript, the results were similar using this approach.
(3) The GDSC cell lines were measured on the multi-omic scale, DNA, mRNA and protein. In the introduction section, clearer motivation evaluating only DNA methylation is required.
Reply: Thank you for this suggestion. We now provide more justification for this in the manuscript.
(4) A number of classification algorithms for discretized drug response, and regression approaches for continuous IC50 were evaluated in terms of the predictive accuracy. However evaluation between the two types seems to be missing.
Reply: Thank you for this suggestion. We have used the Spearman correlation coefficient as a way to assess the concordance between the predicted probabilities (classification algorithms) and predicted IC50 values (regression algorithms). We have added a graph that illustrates the similarities and differences between these types of predictions.
(5) Detailed description on feature selection should be included. CTGF gene and etc were highlighted in the paper. Were these appeared using all the algorithms? What was the estimated prediction accuracy based on the algorithms? Did these genes show the high predictive power for all the drugs and all the cell line lineages?
Reply : We performed feature selection in a post hoc analysis after we performed classification. Accordingly, we did not compare the ability to predict drug responses with and without feature selection. We had multiple reasons for this: 1) we were concerned about overfitting the training data, which we used to tune hyperparameters for classification algorithms, and 2) additional tuning would have been required to select the best n features to use for each classification scenario. To clarify these points, we have changed the terminology that we use in the paper from "feature selection" to "feature ranking." In addition, we have clarified in the manuscript that we used information gain to rank the features, have provided an explanation of the intuition behind information gain, and have clarified that the feature-ranking analysis was performed post hoc. The Supplementary Material provides the feature ranks and scores for all of the drugs and algorithms.
(6) For patients' drug response, the four categories response call was used. When you evaluate the algorithms on the patient data, what was the response call for the cell lines based on IC50? And how do you train the models in the cell lines data? Greater detail is required.
Reply: Thank you for letting us know that this description was not clear enough. We now provide more details. After training the classification algorithms on the cell-line dataset to discriminate between low and high IC50 responses, we used the algorithms to generate probabilistic predictions for each tumor sample. We then evaluated whether the probabilistic predictions of treatment response were correlated with the reported responses for each patient as a way to assess whether higher probabilities of response (based on cell-line observations) correlated with better patient responses. The latter approach is slightly different than we described in the original manuscript. We modified our methodology based on a recommendation from one of the other reviewers (see below).

Response to Reviewer #3:
(1) The introduction lacks an appropriate literature review. It is better to review some papers that have used machine learning methods for predicting drug response. Moreover, it is required to present a summary of other features that have been used, such as mutation, copy number variation, and the papers that have used a combination of several types.

Reply:
We thank the reviewer for this suggestion. We have expanded our coverage of existing literature regarding the use of machine-learning algorithms for predicting drug sensitivity. We now include the following paragraph in the Introduction section: "Many computational methods have been proposed to predict anticancer drug sensitivity based on genetic, genomic, or epigenomic features of cancer cell lines. The most common approach is to generate a drug-specific model, which is independently trained using molecular observations and drug-response data from cell lines tested with each drug individually. Linear-regression based, drug-specific models have been developed using gene expression data (Barretina et al., 2012;Geeleher et al., 2014;Iorio et al., 2016) or a combination of gene expression data and other genomic data types, such as copy number alterations and DNA methylation (Chen and Sun, 2017). Non-linear models using a single data type or multiple data types have also been proposed, including artificial neural networks, random forests, support vector machines, kernel regression, latent and Bayesian approaches, attractor landscape analysis of network dynamics, unsupervised pathway activity models, and recommender systems (Costello et al., 2014;Dong et al., 2015;Zhang et al., 2015;Ammad-ud-din et al., 2016;Corte's-Ciriano et al., 2016;Gupta et al., 2016;Ammad-ud-din et al., 2017;Choi et al., 2017;Rahman et al., 2017;Ali et al., 2018;Chang et al., 2018;Dhruba et al., 2018;Ding et al. 2018;Huang et al., 2018;Suphavilai et al., 2018;Wang et al., 2019;Xu et al., 2019;Emdadi et al., 2020). Transfer-learning techniques have also been proposed to improve drug-response prediction performance for one type of cancer by incorporating data from other types of cancer (Turki et al., 2017). Drug response information has also been modeled in combination with chemical drug properties using elastic net regression, support vector machines, regularized matrix factorization, and manifold Learning (Menden et al., 2013;Yuan et al., 2016;Wang et al., 2017;Moughari et al., 2020;Su et al., 2020)." (2) The method used for feature selection and identify essential genes is not described well. It is required to give a brief description and intuitive about the method used in this section. Relying on solely naming the used packages is not sufficient, and readers may have no clue of what is going on in this analysis.

Reply:
We have added more details about the feature-selection approach that we used in the Methods section, including a description of the intuition behind information gain.
(3) The authors have conducted two biological analyses (1-finding important genes 2predicting drug response for TCGA tumors). However, both analyzes do not yield significant results. The genes found as important genes for predicting anticancer drug response do not play a significant role in cancer procedures. Moreover, both the regression and classification methods were unable to predict drug response for TCGA tumors. Hence, biological analysis does not confirm model competence. It is required to conduct a more significant biological analysis.
Reply: We agree that our methodology did not result in strong predictive performance on the tumor data in most cases or extensive insights into the biology of drug responses (based on the top-ranked genes and subsequent functional analyses). However, our study does provide valid scientific insights, including about subsampling, which was the main focus of our study. Furthermore, even negative results provide value to the scientific community. As the PLOS One policy about negative results states, "In keeping with our mission to publish all valid research, we consider negative and null results." (4) The authors do not have a significant contribution to the method proposing. They analyze the off-the-shelf machine learning methods in various scenarios.

Reply:
The main focus of this research was to study the effect of subsampling on model performance rather than to contribute new algorithms to the field. We evaluated the subsampling strategy using traditional machine learning algorithms as a way to estimate the overall potential to predict drug responses from methylation-based cell-line data. Future studies could explore the potential to improve predictive performance with alternative algorithms or data types.
(5) Authors have used DNA methylation data instead of gene expression because gene expression data is not stable and is measured using various technologies. It is required to use another type of feature using the same model to determine DNA methylation features' effectiveness.
compare the predictive accuracy of multiple data types but rather to estimate the overall potential to predict drug responses from methylation-based cell-line data and to evaluate the effects of data subsampling.
(6) The Enrichment of informative genes in GSEA cancer-related gene families can better illustrate the influence of informative genes in cancer-related procedures. Moreover, finding protein complexes related to these genes using CORUM and iRefWeb websites may reveal more information about the drug mechanism.
Reply: We thank the reviewer for these suggestions. We modified the enrichment analysis to use genes sets (including many which are cancer relation) from the Molecular Signatures Database (also used by GSEA) and their gene-set overlap method. This provided much more relevant results than the previous analysis, including insights about drug resistance.

(7) The regression analysis to show the correlation of predicted continuous probabilities with ordinal TCGA drug response labels seems inappropriate. The authors can use hypothesis testings that are specifically designed for ordinal variables.
Reply: Thank you for this suggestion. After looking at this again, we realized that our approach was flawed. We have modified our approach to use the Spearman rank correlation coefficient as a measure of the similarities between the predicted probabilities and the ordinal responses for TCGA patients. Spearman's method is applicable to ordinal values (see Lehman, Ann (2005). Jmp For Basic Univariate And Multivariate Statistics: A Step-bystep Guide. Cary, NC: SAS Press. p. 123. ISBN 978-1-59047-576-8.). We calculated pvalues for these comparisons and adjusted for multiple tests.

(8) Moreover, it is better to apply a pre-processing method to remove batch effects.
Reply: Thank you for this suggestion. In response to this comment, we have performed a type of batch adjustment. Although batch information was not available at a granular level, we used data source as a batch variable and performed batch-effect adjustment using an Empirical Bayes approach. We also included cell type as a covariate to help mitigate the effects of that variable (per a suggestion from another reviewer). We repeated the entire analysis using these data and have updated these results in the manuscript. Our conclusions remain essentially identical; however, we agree that this resulted in a less-biased dataset.
(9) The authors have conducted their analyses on only eight specific drugs. Therefore, the analyses are not comprehensive and may have biased results. It is better to incorporate many more drugs in analyses or select eight randomly drugs.

Reply:
We selected these drugs based on the amount of available data. With the exception of Gefitinib, these drugs were those that overlapped most between the GDSC and TCGA datasets. We included Gefitinib in our final analysis because it is well known and because we used it in our initial testing. However, we selected the remaining drugs based on data availability. Although it would be ideal to use additional drugs and select them at random, we concluded that selecting drugs based on sample size would make the best use of available data.

(10) The computed criteria for regression methods is not very diverse. RMSE is just the square root of the MSE. Assessing MSE beside RMSE does not convey additional information. Moreover, MAE has almost the same perspective of vision for model evaluation.
It is better to compute more diverse criteria such as Pearson correlation coefficient or R^2.
Reply: Thank you for this suggestion. We have excluded MSE and included R 2 and the Spearman rank correlation coefficient in the regression analyses.

Response to Reviewer #4:
(1) (2) Figures, in general, can benefit from an improved quality, mainly resolution (Figure 2,3). Figure 4 bigger markers and fonts. Also a common style in terms of font type, size and color scheme.
Reply: In our revised manuscript, we have provided high-resolution (TIFF) versions of our figures. In addition, we have modified Figure 4 (now Figure 5) so that it is taller and thus easier to read.

(3) Regarding the regression analysis it would be very interesting to see reported also the correlation values, to have a better idea on the trend of the predictions beyond the average (MSE, RMSE and MAE).
Reply: Thank you for this suggestion. We have excluded MSE (as RMSE is the square root of the MSE) and have included R 2 and the Spearman rank correlation coefficient in the regression analyses.
(4) This might be a bit beyond the scope of the paper, but as shown, in Costello et al. (https://www.nature.com/articles/nbt.2877.pdf), the expression values (transcriptomic) seem to be more predictive compared to other omic features for the different predictive models considered. I think it would be interesting, if it is not too much work, to repeat the translational analysis with transcriptomic data, to see whether the poor performance observed at methylomic level are consistent or change with a different data type.

Reply:
We agree that comparing our results with those obtained using transcriptomic data would be an interesting comparison; we thank the reviewer for this suggestion. One reason we used methylation data is that data were collected using the same technology (Illumina HumanMethylation 450K arrays) for GDSC as well as TCGA. However, GDSC used a relatively obscure type of gene-expression microarray (Affymetrix U219), whereas TCGA used RNA-Sequencing. Although these types of data can be combined for some types of analyses, we concluded that it would be challenging to correct the datasets properly for these platform effects. In other words, the machine-learning algorithms could easily be "distracted" by these platform differences. Furthermore, the main focus of this work was to study the effects of subsampling on model performance, and we were able to address that question using methylation data.

Response to Reviewer #5:
(1) Page 3 Line 11: Different platform and preprocess algorithms were considered challenging points in combining gene expression datasets. I believe it is also the case in general for combining methylation datasets even though GDSC and TCGA used the same methylation profiling assays.

Reply:
We thank the reviewer for this observation. We agree that biases can arise even though two datasets were generated using the same platform. However, one reason we chose methylation data was that the data had been preprocessed using the same methodology for GDSC and TCGA. In contrast, the gene-expression data were prepared using gene-expression microarrays (Affymetrix A219) or RNA-Sequencing (RNA-Sequencing), thus resulting in much more pronounced differences between the platforms. In addition, in response to a comment from one of the other reviewers, we have performed a type of batch adjustment in which we corrected for differences between the datasets (as well as cell type). Our updated manuscript includes these results, which are qualitatively similar to the previous ones.
(2) Page 11 Line 3-6: Performances of various classification algorithms used in this study were well summarized in Table 2  Reply: Thank you for your comment. Perhaps we misunderstand the reviewer's comment, but we did not have a way to calculate a mean or standard deviation for the probabilities because we trained only once (per drug) and then made one prediction per patient. Each algorithm produces a single set of probabilistic predictions per patient. One of the other reviewers suggested that we use an approach for evaluating the probabilities that accounts for the ordinal nature of the patient drug responses. We modified this part of the analysis to use the Spearman rank correlation coefficient. (5)

Reply:
We thank the reviewer for highlighting that subsampling was the primary focus of this study. In response to this comment and comments from other reviewers, we have modified the Discussion section to address the differences we observed between classification and regression in more depth. We have also added an analysis in which we compare the classification results directly against the regression results using a common metric.