A machine learning model trained on a high-throughput antibacterial screen increases the hit rate of drug discovery

Screening for novel antibacterial compounds in small molecule libraries has a low success rate. We applied machine learning (ML)-based virtual screening for antibacterial activity and evaluated its predictive power by experimental validation. We first binarized 29,537 compounds according to their growth inhibitory activity (hit rate 0.87%) against the antibiotic-resistant bacterium Burkholderia cenocepacia and described their molecular features with a directed-message passing neural network (D-MPNN). Then, we used the data to train an ML model that achieved a receiver operating characteristic (ROC) score of 0.823 on the test set. Finally, we predicted antibacterial activity in virtual libraries corresponding to 1,614 compounds from the Food and Drug Administration (FDA)-approved list and 224,205 natural products. Hit rates of 26% and 12%, respectively, were obtained when we tested the top-ranked predicted compounds for growth inhibitory activity against B. cenocepacia, which represents at least a 14-fold increase from the previous hit rate. In addition, more than 51% of the predicted antibacterial natural compounds inhibited ESKAPE pathogens showing that predictions expand beyond the organism-specific dataset to a broad range of bacteria. Overall, the developed ML approach can be used for compound prioritization before screening, increasing the typical hit rate of drug discovery.

We thank the reviewer for this great suggestion. We have now added the Matthews correlation coefficient (MCC) to our binary classification results in addition to ROC-AUC, PRC-AUC, and F1 score. We have added the MCC analysis to Tables S1, and S2, and modified the related contents in the results section accordingly (see lines 118-119 in page 6). The definition of MCC was also added to the section "Evaluation Methods" in line 384-395 of page 16.
2. All the source codes were not listed in the manuscript, and it would be very helpful if they can be provided to support open science.
We thank the reviewer for the above suggestion. The source codes of the machine learning model (D-MPNN)  Reviewer #3: In this manuscript, the authors trained an ML model with data from a high throughput screening experiment and used this ML model to predict compounds with antibacterial activity in the library of FDA-approved compounds and natural products. Then, some compounds with growth inhibitory activity against several Gram-negative bacteria were identified by wet experiments. This manuscript combines experiments and computation well and has the potential to be accepted. However, the reviewer has some concerns and suggestions.
Major Comments: (1) The authors claimed many times in the article that their approach increases the hit rate of drug discovery by 12-fold at least. However, the hit rate of the virtual screen has a great relationship with the experimental system and the definition of hits. It is not appropriate to quantitatively compare the hit rate in this study with that of the hit rate from conventional whole-cell-based highthroughput screens. I recommend that the authors compare the hit rate in this study with the hit rate from their previously performed HTS (the training dataset).
We appreciate and agree with this thoughtful and important comment by the reviewer. The hit rate of the previously performed HTS was 0.67% (Selin et al., 2015). This hit rate was slightly modified after binarization by the B-score to 0.87%, The new hit rates obtained after prediction by the model and testing the top predicted compounds were 26% and 12% for the FDA-approved library and the natural product library, respectively. Therefore, our calculation of an increase in the hit rate with respect to the training set is 14-fold. We have modified this number and added an explanatory paragraph in the discussion (see page 11, lines 254-255 and 261-264).
(2) Most of the compounds screened by the ML model from the FDA-approved compound library are known antibiotics. Are there very similar molecules to those hit compounds in the training set?
Please provide the maximum similarity between these hit compounds and the compounds in the training set.  (Fig S7A).
One is an antifungal called clioquinol. The other compound, cetylpyridinium, is an antimicrobial used in oral health products to promote gingival health. The screen also identified several compounds in the subset of the FDA library that were similar in structure to compounds in the training set ( Fig S7B); however, none of the compounds was identified as active.
Minor Comments: (1) The compound structures in Fig 3 and Fig 6 are not clear enough.
We appreciate the reviewer's feedback. Figures 3 and 6 have now been modified to make the structures of the compounds clearer.
(2) Is the training dataset in this study available?