Machine Learning augmented docking studies of aminothioureas at the SARS-CoV-2—ACE2 interface

The current pandemic outbreak clearly indicated the urgent need for tools allowing fast predictions of bioactivity of a large number of compounds, either available or at least synthesizable. In the computational chemistry toolbox, several such tools are available, with the main ones being docking and structure-activity relationship modeling either by classical linear QSAR or Machine Learning techniques. In this contribution, we focus on the comparison of the results obtained using different docking protocols on the example of the search for bioactivity of compounds containing N-N-C(S)-N scaffold at the S-protein of SARS-CoV-2 virus with ACE2 human receptor interface. Based on over 1800 structures in the training set we have predicted binding properties of the complete set of nearly 600000 structures from the same class using the Machine Learning Random Forest Regressor approach.


Introduction
Subsequent outbreaks of pandemics [1] culminating in the current Covid-19 highlighted the necessity of protective actions from the scientific community. This was manifested in the initial attempts of repurposing currently used drugs, followed by a search for novel antiviral compounds and vaccines. Although the effort put into the studies of agents preventing infection caused by the SARS-CoV-2 virus worldwide is impressive, neither new effective drugs have been discovered nor there is a reassurance that vaccines will catch up with the fast mutations of the virus. This indicates the need for the evaluation of the antiviral activity of synthesizable compounds. In the case of current pandemics the search started with docking approach, pioneered by the most extensive study [2] of the inhibition at the interface between the spike protein (S-protein) of the virus and the human ACE2 receptor, responsible for the viral recognition of host cells (see reference [3] for a recent summary). These studies, while quite exhaustive, were restricted to about 9000 compounds although performed with the aid of one of the fastest available supercomputers. In the chemoinformatics toolbox for studies of ligands interaction with enzymes, the reliability of methods diminishes from molecular dynamics to docking to various variants of Quantitative Structure-Activity Relationship (QSAR). However, the rate of processing ligand structures increases dramatically in the same order. Thus different QSAR methods should allow the exploration of large sets of potential antiviral compounds. The main drawback in applying this approach lies in the fact that it requires large data sets on the activity of closely related compounds to build reliable models. Such data is usually missing, especially when the need for models is urgent. In the lieu of experimental data, the results of docking might be used, although one has to keep in mind that the results of docking do not always correlate with bioactivity.
We have previously [4] attempted building a QSAR model of interactions between the series of compounds containing linear or cyclic N-N-C(S)-N structural motif with the interface between the virus S-protein and the ACE2 receptor but a training set of only 160 compounds did not lead to a reliable QSAR model. In this contribution, therefore, we have extended the number of considered ligands over 10-fold (to 1820) by the inclusion of compounds that can be readily synthesized. We have carried out docking studies using four different docking algorithms and subsequently used these docking scores as the training set and subsequently predicted binding properties of the complete library of 597780 structures from the same class by applying fingerprint-based Machine Learning model employing Random Forests classifiers [5][6][7] which rate outperforms classical QSAR methods by a few orders of magnitudes.

Results
The main focus of the present studies was on the identification of efficient evaluation of bioactivity of compounds containing NH-NH-C(S)-NH motif, which was based on the ability of their binding to the S-protein of SARS-CoV-2 virus-ACE2 human receptor interface which structure was retrieved from the Protein Data Bank (PDB: 6M0J [25]). Considered substituted structures of thiosemicarbazides, thiadiazoles, and triazoles are schematically presented in Fig  4 while all obtained results of docking are collected in Table S1 deposited in the public repository (see Data Availability section). The studied molecules included linear carbonylthiosemicarbazide skeleton and its three cyclic derivatives: 1,3,4-thiadiazole, and 1,2,4-triazole (in the thiol and thionic forms) cores decorated by five different five-member rings as the C-substituent and substituted phenyl ring as the N-substituent. In total 1820 structures including all mono-, di-, and diortho-para-halogen-substituted R 2 substituents have been used.
Four docking scoring functions have been used. These include Vina (Windows implementation in the Chimera environment), FlexX and Hyde (implemented in LeadIT), and ChemPLP (implemented in Gold)-see Materials and Methods section for details. Note that ChemPLP scores, in contrast to the other algorithms employed herein, use mathematical formulas in which the more favorable interactions result in a higher score.
Subsequently, Machine Learning models using Random Forest Regressor have been trained on all four sets of docking results (see Materials and Methods). The correlations obtained during training (R^2) and leave one out validation (Q^2) quantifying models performance are summarized in Table 1 and illustrated in Fig 1. The values of Q^2 above 0.75 are generally an indication of a model with useful predictive capabilities. The best fit was obtained for FlexX, while Vina and ChemPLP docking yielded acceptable correlations. A somewhat worse correlation between the docking scores and molecular fingerprints has been obtained with Hyde.

Discussion
Fingerprint-based Random Forests Regressors model yielded excellent correlation in the case of FlexX results, very good in cases of Vina and ChemPLP, and slightly worse in the case of Hyde. Since no methods of direct visualization of Random Forests exist we have deepened the analysis of the results by performing t-distributed Stochastic Neighbor Embedding (t-SNE) [26] analysis of the descriptor space, as it excels over methods like Principal Component Analysis for highly dimensional data [27]-4096 dimensions in our case. This analysis is illustrated in Fig 2. The 20 clusters appearing in the t-SNE plots were verified to represent the significantly chemically different groups of compounds (all combinations of core moieties and R 1 substituent). The ability of t-SNE to identify the chemically different groups of compounds confirms the choice of fingerprints to describe our compounds. It should be noted that the activity data of the compounds was not used in the t-SNE analysis, it was only added at the stage of plot preparation. Any correlations observed between the activity (presented as color in Fig 2) and the position of the molecule in the t-SNE plots should be interpreted as intrinsic correlations between the activity and chemistry of the molecule. A cluster represents a group of molecules with similar fingerprint patterns, that can be understood as a structural similarity. The appearance of clusters uniform in color (= activity), especially visible in Fig 2A suggests

PLOS ONE
Aminothioureas docking to SARS-CoV-2-ACE2 interface that small variations in structural features do not significantly contribute to the activity. While there is a consensus between docking scores of FlexX, Vina, and ChemPLP, Hyde scores appear to vary within each cluster, a fact that might explain the lower performance of the SAR modeling of Hyde scores. As we have indicated, computational tools that can be used for the prediction of bioactivity of different compounds need nowadays to be very fast to be able to cope with big data collections. Machine Learning techniques like Random Forests outperform significantly other methods such as molecular dynamics, docking, and classical QSAR. While our previous studies [28,29] involving similar techniques hinted at Random Forest performing much better than classical QSAR in the modeling of the docking scores we were unable to sufficiently support such a

PLOS ONE
conclusion due to the limited number of compounds studied. Our present results provide clear evidence that Random Forests calculations trained on docking results can provide an improved scientific tool with better rate and precision of predictions that allow evaluation of properties of hundreds of thousands of compounds in a realistic time. The practice of training fast methods on more precise ones is in fact quite common in computational chemistry. For example, computationally cheaper molecular mechanics force fields can be trained on data from expensive high-level ab initio computations.
However, having evaluated a large library of nearly 600000 compounds comprising the -N-N-C(S)-N-motif, we did not identify any compound that would be a better candidate for the lead compounds for further drug development than those which were in the training set. Therefore, below we discuss the results obtained from docking. Due to the lack of experimental data, and thus our inability to put more trust into particular docking algorithms used, we have ordered all results within a given docking protocol from the best to worst and assigned them a rank corresponding to the position on the list. In this way, the four best compounds have been identified. Subsequently, we have compiled a similar list according to the average rank in all

PLOS ONE
four docking protocols, a "consensus" ranking [30]. These five best compounds are collected in Fig 5. In general, these results indicate that the linear thiosemicarbazides arrangement is preferred, these compounds occupy the first 20 positions on the consensus rank list. This result is not too surprising taking into account the length of the interface rim. Within the best-scored compounds, the majority contain the hydroxyl group in the ortho position of the R 2 substituent. Compounds highly substituted in the phenyl ring did not score high, although triply substituted, with both ortho positions occupied scored highest in the case of ChemPLP and Vina docking. The interactions in the groove connecting S-protein (presented in yellow) with ACE2 receptor (presented in green) are illustrated in Fig 3 on  angles are low indicating that these hydrogen bonds are very weak. In the blind docking to the SARS-CoV-2 S-protein-ACE2 receptor complex, as well as to both these proteins separately docking scores are best at the illustrated position indicating that its mode of action would be stabilization of the complex and thus trapping the virus rather than inhibiting its complexation with the receptor.
All compounds collected in Fig 5 were subjected to ADMET analysis. Major properties pertinent to selecting lead compounds [31] are collected in Table 2. As can be seen, they compare favorably with these of the two drugs tried clinically against Covid-19 (chlorquine and remdesivir).

Docking
Four docking algorithms were used. In the case of the FlexX algorithm [32], as implemented in the LeadIT platform [33], docking space was defined as a sphere with a radius of 7.5 Å centered at the point (83.5, 37.5, 110.0 Å) in the middle of the rim of the interface. Two different strategies were used. The first one was docking corresponding to a rigid receptor. In the second a 100 Å 3 penetration of the van der Waals radii was allowed to account for the protein flexibility. The results of both docking strategies were found to be highly correlated and therefore only the results of docking with "flexible" protein receptor were considered in further studies. All best structures obtained for individual ligands using FlexX were subsequently subjected to docking refinement by a relatively new algorithm Hyde [34] implemented in the same platform. For docking using Vina [35] the standalone Windows-based executable has been used. The binding site was limited to the interface space by defining a 100.0x65.0x80.0 ÅxÅxÅ box centered at the same point as in FlexX docking using a visualization tool implemented in the Chimera [36] program. The same box was defined in studies using SwissDock [37] but it has been neglected by the server and blind docking has been performed instead. Since only a single ligand per submission to the server was possible we have carried out docking for only about 300 ligands and manually selected clusters docked in the space relevant to the interface. Finally, the ChemPLP algorithm [38] as implemented in the Gold program [39] was used with the same docking space as in the case of FlexX calculations. This algorithm has been considered as one of the best in most recent benchmark studies [40]. Blind docking in the case of all algorithms was used to check if the binding at the interface is the optimal place for a given ligand. Furthermore, binding to individual proteins (ACE2 receptor and S-protein) has been carried out to investigate the role of ligands (as a binder of binding inhibitor). For the proteins and ligands preparation and visualization, apart from those embedded in the docking programs, Hyperchem [41], Gaussview [42], Chimera [19], and Mercury [43] were used.

Machine Learning using Random Forest Regressor
All of the 1820 structures were converted from 2D to 3D using RdKit [44] and relaxed at the molecular mechanics level using MMFF94 Merck Force Field [45]. Morgan Fingerprints [27] with a radius of 3 and bit length of 4096 were used as a representation of general structural features of compounds. Calculations were done using Python scripts in the Anaconda environment. Models were built using scikit-learn [46] implementation of Random Forests Regressor with grid optimization of the most important hyperparameters listed in Table 3. The R^2 on  the whole training set and Q^2 by five-fold cross-validation were used as metrics for learning and prediction performance, respectively. Each of the considered docking scores modeled delivered a separate hyperparameter set. The final models were validated by leave one out cross-validation procedure.
Once mathematical models were created they were applied to the collection of 597780 structures comprising all variants of R 2 , i.e., all combinations of the phenyl ring decorated with

PLOS ONE
one to five substituents listed in the last column of Fig 4 after removal of structures present in the learning set. This collection was build using our own Python scripts in the Anaconda environment [47].
ADMET SwissADME program [48] implemented online [49] has been used for the assessment of ADME properties and online implementation [50] of the PreADMET program [51] was used for basic toxicology properties of all 1820 compounds in the training set. For the best results collected in Fig 5 Lipiński's rules, solubility, and gastrointestinal absorption has been taken from the SwissADME. The toxicity of these compounds to humans (the last five entries reported in Table 2) has been obtained using the online [52] ADMETlab platform [53].