A computational method for drug sensitivity prediction of cancer cell lines based on various molecular information

Determining sensitive drugs for a patient is one of the most critical problems in precision medicine. Using genomic profiles of the tumor and drug information can help in tailoring the most efficient treatment for a patient. In this paper, we proposed a classification machine learning approach that predicts the sensitive/resistant drugs for a cell line. It can be performed by using both drug and cell line similarities, one of the cell line or drug similarities, or even not using any similarity information. This paper investigates the influence of using previously defined as well as two newly introduced similarities on predicting anti-cancer drug sensitivity. The proposed method uses max concentration thresholds for assigning drug responses to class labels. Its performance was evaluated using stratified five-fold cross-validation on cell line-drug pairs in two datasets. Assessing the predictive powers of the proposed model and three sets of methods, including state-of-the-art classification methods, state-of-the-art regression methods, and off-the-shelf classification machine learning approaches shows that the proposed method outperforms other methods. Moreover, The efficiency of the model is evaluated in tissue-specific conditions. Besides, the novel sensitive associations predicted by this model were verified by several supportive evidence in the literature and reliable database. Therefore, the proposed model can efficiently be used in predicting anti-cancer drug sensitivity. Material and implementation are available at https://github.com/fahmadimoughari/CDSML.


The detailed formulae of manifold leaning in CDSML
The manifold learning tries to decompose B m×n matrix into latent matrices X m×k and Y n×k , such that B ≈ XY T . Therefore we aim to minimize the difference between B and B ≈. So the initial loss function is defined in Eq. 1.
Moreover, it is not favorable for the latent matrices to have high norm because it may put the threat of high variance for the model. Therefore, we aim to minimize the norm of latent matrices. The improved version of loss function is defined in Eq. 2.
Furthermore, it is desirable to learn the latent matrices such that it preserve the manifold properties of the samples. Clearly speaking, we aim to find the latent matrices such that for any pair of cell lines c i and c j , the distance of c i and c j can be estimated by the euclidean distance of their latent vectors. In other words, for c i and c j pairs that have high similarity, ||X(i) − X(j)|| 2 must not be a great value. The similar constraint should be satisfied for drugs. Therefore, the final loss function is defined in Eq. 3.
where α and β are the regularization and similarity conservation coefficients. Two latent matrices X and Y were updated using Newton's method to minimize the loss function iteratively. X (0) and Y (0) were initialized randomly and afterwards, X (k) and Y (k) were updated using the rules defined in Formulae 4 and 5, respectively.
Thus, we must compute ∇ X (k) Loss, ∇ 2 X (k) Loss, ∇ Y (k) Loss, and ∇ 2 Y (k) Loss based on the following formulae: Eventually, the latent vectors are updated using Eq. 10 and 11.
2 The detailed formulae for handling the response matrix with missing values CDSML can handle the binary response matrix B without imputing missing values. To this aim, one should ignore computing the first term of loss function for missing pairs. Eq. 12 is capable of handling the response matrix containing missing values. (M issing) are computed based on the following formulae: Eventually, the latent vectors are updated using Eq. 17 and 18.
The detailed formulae for single, or no similarity scenarios The CDSML performance can be evaluated in three different scenarios: double similarity, single similarity, and no similarity. In case of using double similarity the formulae described in Section 1 are used. For single ans no similarity scenarios we used the formulae described in the following subsections.

Single scenario
If only SC is ignored, the loss function will be changed to the Eq. 19. (SD) , and ∇ 2 Y (k) Loss (SD) are computed based on the following formulae: Eventually, the latent vectors are updated using Eq. 24 and 25.
If only SD is ignored, the loss function will be changed to the Eq. 26. (SC) , and ∇ 2 Y (k) Loss (SC) based on the following formulae: Eventually, the latent vectors are updated using Eq. 31 and 32.

No similarity scenario
If both SC, SD are ignored, the loss function will be changed to the Eq. 33. Loss (N o sim) based on the following formulae: Eventually, the latent vectors are updated using Eq. 38 and 39.
The performance of CDSML on CCLE dataset using different scenarios

Comparison on missing pairs
We applied CDSML method on GDSC dataset with imputation step and obtained CDSML labels for missing pairs (P red ). On the other hand, P red (Zhang) L labels were obtained according to the explained procedure in previous section. the following metrics compares P red The evaluated criteria validate that there is a favorable agreement between the predicted labels by CDSML and Zhang et al. method.

Comparison on known pairs
We evaluated both CDSML and Zhang et al. method using 5-fold cross-validation on known pairs. Then, we convert predicted values by Zhang et al. method into sensitive/resistant labels and compute the classification criteria for them.