Bioactivity assessment of natural compounds using machine learning models trained on target similarity between drugs

doi:10.1371/journal.pcbi.1010029

Fig 1.

(A) Overview of workflow deployed. A training-cum-validation set comprising of drug pairs was created using various predictor variables (fingerprints, MCS and physicochemical properties). The model was trained for response variable (Match or Nomatch) and tested on an independent test set for performance evaluation. The natural compound library paired with drugs was virtually screened to obtain hit pairs, followed by analysis and in-vitro validation. (B-C)—Similarity metrics (ML dataset). (B) Molecular fingerprints—the 7 fingerprints generate a different similarity score for the pairs of drug molecules compared. The median value of each is represented in the box plot (in the center) and the spread shows the density of the drug pairs around that score. (C) MCS—there are two types of scores reported by the MCS algorithm, one is the Tanimoto score and the other is the Overlap coefficient (OC). The violin plots were smoothed for density by an adjustment factor of 3. (D-F)—Performance on the test set. (D) performance of the four models, viz., regularized logistic regression (L1R and L2R), naïve bayes (NB) and random forest (RF) on independent test set for all 5 split-sets. Performance was evaluated using balanced measures: F1 score, matthews correlation coefficient (MCC), positive predictive value (PPV) and area under the curve (AUC). RF clearly had higher performance as compared to the logistic regression and naïve bayes models under all metrics and data splits. The performance of all models was also evaluated using (E) precision-recall and (F) ROC curve–the RF models achieved an AUC of 0.90 averaged on the all 5 test-split sets whereas NB and LRs performed relatively poor on all split-sets (average: NB: 0.68, L1R: 0.51 and L2R: 0.50). (G) High ranking features of RF models on the 5 split-sets–top features are displayed, showing most of the distance-based features provided maximum information gain with ‘Featmorgan’ performing best.

More »

Expand

Fig 2.

Drug-food compound similarity.

(A) Number of hits retrieved from each split-sets model. (B) 200 drug-food pairs predicted as ‘match’ at the probability threshold of >0.5. The drugs are arranged according to their therapeutic class and food compounds according to their food source. The highlighted colored links represent the case examples in the five author defined groups (details in the text). (C) Group4-probable lead example taken up for experimental validation. The food compound 5-methoxysalicylic acid was a hit with the drug triflusal which has 4 known targets. We validated the inhibitory activity of triflusal and 5-methoxysalicylic acid against the target PTGS1 (also known as Cox-1).

More »

Expand

Fig 3.

Cox-1 inhibitor assay.

(A) Chemical structures of all the tested compounds. MCS structures are also depicted which helped to intuitively assess the structural similarity between the tested compounds (B) An example relative fluorescent units (RFU) plot of the tested compounds at 100μM (other tested conc.: 12.5μM to 400μM serial dilutions). SC560 is a positive control provided by the assay kit supplier (Materials and methods). (C) Relative inhibition of the positive control (drug triflusal), test compound (5-methoxy salicylic acid) and negative control (4-isopropyl benzoic acid) at different tested concentrations. 5-methoxy salicylic acid showed similar inhibition of Cox-1 as the drug triflusal whereas no such inhibition was observed for 4-isopropyl benzoic acid. 4-isopropyl benzoic acid showed strong color change (bright pink) reaction beyond 100μM and thus was found unsuitable for being tested at higher concentration with this assay.

More »

Expand

Fig 4.

RF vs featmorgan.

(A) Number of hits retrieved by using tanimoto score with featmorgan as similarity measure, which grows markedly as threshold is reduced (lower threshold means less similarity). (B) correlation between RF models’ average probability predictions >0.5 with corresponding tanimoto score of featmorgan of drug-food pairs. Our hit pair triflusal and 5-methoxysalicylic acid (highlighted in red) was predicted a hit by RF models (as top 219^th pair) would be missed by featmorgan if used alone. (C) Rank comparison between hit pair (Triflusal:5-methoxysalicylic acid) and the negative control (Triflusal:4-isopropylbenzoic acid). The negative control was not a hit using RF models although had a higher rank with featmorgan than the hit pair and vice versa.

More »

Expand