Virtual Screening Models for Prediction of HIV-1 RT Associated RNase H Inhibition

The increasing resistance to current therapeutic agents for HIV drug regiment remains a major problem for effective acquired immune deficiency syndrome (AIDS) therapy. Many potential inhibitors have today been developed which inhibits key cellular pathways in the HIV cycle. Inhibition of HIV-1 reverse transcriptase associated ribonuclease H (RNase H) function provides a novel target for anti-HIV chemotherapy. Here we report on the applicability of conceptually different in silico approaches as virtual screening (VS) tools in order to efficiently identify RNase H inhibitors from large chemical databases. The methods used here include machine-learning algorithms (e.g. support vector machine, random forest and kappa nearest neighbor), shape similarity (rapid overlay of chemical structures), pharmacophore, molecular interaction fields-based fingerprints for ligands and protein (FLAP) and flexible ligand docking methods. The results show that receptor-based flexible docking experiments provides good enrichment (80–90%) compared to ligand-based approaches such as FLAP (74%), shape similarity (75%) and random forest (72%). Thus, this study suggests that flexible docking experiments is the model of choice in terms of best retrieval of active from inactive compounds and efficiency and efficacy schemes. Moreover, shape similarity, machine learning and FLAP models could also be used for further validation or filtration in virtual screening processes. The best models could potentially be use for identifying structurally diverse and selective RNase H inhibitors from large chemical databases. In addition, pharmacophore models suggest that the inter-distance between hydrogen bond acceptors play a key role in inhibition of the RNase H domain through metal chelation.


Introduction
According to a recent report from the UNSAIDS, it is estimated that more than 34 million people are living with a HIV-1 type infection worldwide and 2.5 million new HIV infections occur every year. Currently, 14.8 million people are eligible for HIV treatment, however only 8 million people are under treatment due to various reasons which includes economical issues [1]. Although AIDS related mortality has been reduced by 24% (1.7 million in 2011) compared to 2005 data (2.3 million), development of improved anti-HIV regiments is still required. To control HIV progression, several viable chemo-targets have been identified [2,3] in the HIV replication cycle, such as fusion or entry of HIV with the host CD4 receptor, reverse transcription of viral RNA into viral DNA (by reverse transcriptase), integration of viral DNA with host DNA (by integrase), and maturation of new viral protein (by protease). Although significant improvements have been made in HIV therapy within the last decades, there is still a strong demand for improving AIDS therapy due to an increasing drug resistance [4,5]. Reverse transcription of genomic single strand RNA into double strand DNA (dsDNA) by viral reverse transcriptase (RT) is a key process in replication of HIV and dsDNA is subsequently integrated into the genome of the host cell. RT has two catalytic domains in order to carry out the reverse transcription process. Very briefly (1) the DNA polymerase domain uses cellular RNA primer (specifically tRNA lys3 ) to synthesize single strand viral (2) DNA (RNA dependent DNA synthesis), subsequently, the synthesized HIV (2) DNA is hybridized with a viral RNA template to form a RNA:DNA hybrid duplex, (2) RNase H domain removes the RNA strand from the hybrid and facilitate the first strand transfer which leads to formation of purine rich sequence of HIV RNA (also called ''polypurine tract'' (PPT). Here, PPT serves as a primer for the synthesis of viral (+) DNA strand and subsequently the RNase H removes the PPT portion after priming of (+) DNA synthesis. The majority of the currently marked antiviral drugs that have been approved by FDA (The Food and Drug Administration) for the treatment of HIV infection are RT inhibitors which particularly inhibits at the polymerase domain [6]. Due to the high rate of viral mutation and resistance to current drug regiments [7], considerable attention has in recent years been paid to less explored target sites within the HIV replication process [8][9][10], one such target is the inhibition of RT associated RNase activity [11].
RNase H is one of the two domains of the p66 (66 kDa) subunit of reverse transcriptase. From mutation and x-ray crystallographic studies [12][13][14][15] the structure of the RNase H domain has been well characterized. It is composed of five standard mixed sheets, which are surrounded by four helix, and eight loops in the center of the domain. The active site of RNase H consists of four highly conserved amino acids Asp-443 (D443), Glu-478 (E478), Asp-498 (D498) and Asp-549 (D549) ( Figure 1A/B) and two catalytically active magnesium ions (Mg 2+ ). Moreover, it has been shown that mutation of any of these residues abolishes the enzyme activity. This is because these residues provide a favorable environment for stabilizing metals (Mg 2+ ), which is essential for a proper binding and positioning of the RNA:DNA duplex during the digestion process [16]. Over the past years a large number of RNase H inhibitors have been designed, synthesized and some entered into the clinical trial, but none of these have yet reached the market. Furthermore, these inhibitors all show a low level of specificity towards RNase H, meaning that many of these compounds also inhibit polymerase [17][18][19]. Two mechanisms of RNase H inhibition have been proposed; (i) compounds that blocks metal binding at the active site, (also called ''active site directed RNase Inhibition'') and (ii) compounds that binds at the allosteric site of RNase H [20]. Both classes of compounds show high level of ligand diversity in general, however, the active site directed RNase H inhibitors are unique in the sense that the majority of inhibitors are rich on hydrogen bond acceptor sites, which is believed to form chelation with the magnesium ions [14,15,[21][22][23]. Representative examples of molecules belonging to this class of site directed RNase H inhibition is shown in Figure 1C.
With an increasing number of high resolution 3D structures of HIV-1 RT associated RNase H domain and deposit of known inhibitor data in public databases, ligand and structure-based modeling tools may readily be used for designing selective inhibitors. Because of the rapid advancement in high throughput screening (HTS) approaches, the availability of screened compounds is increasing exponentially; for example, Parniak et al.
screened nearly 100000 NCI chemical compounds for the HIV-1 RNase H inhibitor and deposited these at the PubChem database (AID: 372) [24]. This enormous information can subsequently be analyzed and trained for new lead identification through cheminformatics and bioinformatics approaches, e.g. a fingerprint based-decision tree method has been applied to classify inhibitors of the RNase H with success rate of approximately 99% for 10-fold cross validation (sensitivity = 57%, specificity = 99%) [25]. Very recently, ligand based virtual screening was successfully performed using shape-based similarity searching methods [26] for identification of duel inhibitors of HIV-1 RT. Subsequently the models were applied to screen the NCI chemical database and finally found 4 out of 34 molecules to shown inhibition at micro molar concentrations (2-22 mM). Thus computer-aided approaches such as molecular docking, 3D-QSAR, pharmacophore modeling, machine learning methods and shape-similarity based virtual screening has emerged as effective methods for identification novel compounds [27][28][29]. It has previously been observed from validation of virtual screening experiments that docking methods performed best in identifying high affinity CYP1A2 inhibitors compared to machine learning based screenings (correctly predicted 21 out of 41 tested). However, the models developed with random forest were found to be very efficient since it required only 39 two-dimensional physicochemical descriptors (or properties) for new prediction [27].
The aim of the present study is to develop virtual screening models for HIV-1 RT associated RNase H inhibition through conceptually different computational methodologies, for instance, using ligand and structure information. To this end, five screening methods have been evaluated and it is suggested which of these methods is best suite for efficient identification. The methods applied comprises (1) pharmacophore modeling, (2) ligand flexible docking, (3) grid based pharmacophoric -fingerprint for ligand and proteins (FLAP), (4) machine learning algorithms, (such as random forest, support vector machine, kappa nearest neighbor) (5) rapid overlay of chemical structures (ROCS). This is, to our knowledge, the first time that such conceptually different methodologies have been assessed with respect to efficacy and efficiency in virtual screening setup for identification of RNase H inhibitor of HIV-1 reverse transcriptase.

Ligand preparation
A set of 135 HIV-1 RT associated RNase H inhibitors was collected from the literature [14,15,22,[29][30][31][32][33][34][35][36][37][38][39][40]. This dataset consists of different scaffolds including highly active pharmacophores as represented in Figure 1C. In addition to these literature compounds, structure and biological activities (reported in binary format, i.e active as 1 and inactive as 0) of 99766 compounds were collected from the PubChem Bioassay database (AID: 372) [24]. In brief, the inhibition of HIV-1 RT associated RNase H catalyzing the RNA:DNA duplex (substrate) to fluorescein labeled-RNA strand was measured. Fluorescein tagged substrates show very low background fluorescence and upon hydrolysis the fluorescence increases up to 50 folds. Compound inhibition is thereby proportional to the fluorescence intensity [41]. Of the 99766 compounds retrieved from PubChem, 1863 compounds were removed, due to lack of either activity data or structures and in come cases the structures existed as duplicate entries. The remaining structures were imported into the Standardizer tool [42] (version 5.12.2) in order to refine the structures by normalization of chemo-types (e.g. tautomerization, aromatization and remove steric clashes). The remaining 97903 compounds, which includes 760 active and 97143 inactive, were processed for further steps. The overall workflow of the present study is shown in Figure 2.

3D conformation generation
Both set of compounds (literature and PubChem) were converted into 3D structures using OMEGA [43] (a conformation generating tool). As default OMEGA reports multi-conformer for all compounds using the MMFF95S force field, however, in the present study only a single conformer i.e. the one possessing the lowest energy was used.

Subset Selection
Bioactivities and structures were imported into KNIME (version 2.6.3) [44], an open source data integration platform, which provides a variety of data mining/chemistry applications and data analysis nodes. In this study we used KNIME for subset selection purposes as well as for machine learning model developments. In order to reduce the computational time, we selected a small subset of inactive (n = 971 compounds, ''Subset_inactive'') of the PubChem dataset, which represents the whole inactive in the dataset. The selection was made randomly as implemented in KNIME and activity threshold for the inactive compound was 0-50 (% of inhibition). Of the 760 active compounds in the PubChem dataset, 20 highly active and structurally diverse compounds were randomly chosen using KNIME (activity threshold .80% of inhibition) and stored for enrichment study as known binder (n = 20, ''Subset_actives''). Structures and bioactivities of subsets used in this study are provided in the supplementary material.

Template selection
Canvas (version 1.5) chem.-informatics tool [45] was used to perform a hierarchical clustering based on the dendritic fingerprints and Tanimoto similarity (TS) as method of metric (TS AB = c/a+b2c, a = bits set to 1 in molecule A, b = bits set to 1 in molecule B, c = number of 1 bits common to molecule A and B). From the 135 compounds in the literature dataset, 22 individual clusters were obtained with merging distance cutoff 0.80. This was reduced to 15 clusters (merging distance cutoff 0.93) and two highly active compounds (in terms of pIC50) from each cluster were selected as template (n = 30, ''subset_active_lit'') for pharmacophore search and enrichment study. A representative compound for each cluster as well as the full list of compounds used as known actives for validation is provided in the supplementary material [ Figures S1, S2]. Special care was taken to verify that the x-ray bound ligands were also part of the subset selected as template.

Machine learning methods and Feature selection methods
Various commonly used machine learning (ML) techniques such as support vector machine (SVM), kappa nearest neighbor (kNN) and random forest (RF) were applied. These methods also represent common approaches for classification. A detailed account of the theory behind these methods can be found elsewhere [46]. All the classification models were constructed using Weka [47], which provides a set of classification and regression methods, and attribute selection methods. In the present study, various automatic feature selection procedures were applied such as CfsSubsetEval (correlation-based feature subset selection evaluator) with BestFirst and Genetic search algorithms. All machine learning based classification and attribute selections used in this study were performed with Weka nodes as implemented in KNIME.

Molecular descriptors and fingerprints
Ligand information for classification was obtained from molecule descriptors and fingerprints. A set of 494 two-dimensional descriptors and 269 fingerprints (including MACCS structural keys and Sub-structure fingerprints) were calculated using PaDEL [48], a descriptor computing software as implemented in KNIME. Descriptors used in this study includes physical properties, atom and bond counts, adjacency and distance matrix descriptors containing BCUT and GCUT descriptors, Kier & Hall connectivity, kappa shape indices, subdivided surface areas, and pharmacophoric features. Fingerprints represents small ''substructures'' consisting of one to ten non-hydrogen atoms and enumerates very simple features, but when used in combination they can prove to be very specific and useful in distinguishing the characteristics among the small molecules. Low variance descriptors (variance upper bound set to 0.0), and insignificant fingerprints (bits) were removed before the model optimization.

Pharmacophore modeling
Pharmacophore modeling was carried out using the Phase (version 3.1) [49] module of the Schrödinger molecular modeling suits [50]. Phase uses LigPrep for the structure cleaning process, which includes generating possible tautomers, and ionization states (at pH 7), followed by generating conformers using the ''Conf-Gen'' macro-model search algorithm with thorough sampling. The OPLS2005 force field was chosen with a distance-dependent dielectric solvent model. Phase develops the pharmacophore hypothesis in three steps; (1) creation of pharmacophoric sites from active compounds using six common features: hydrogen bond donor (D), hydrogen bond acceptor (A), hydrophobic region (H), negatively charged region (N), positively charged region (P) and aromatic rings (R), (2) perceiving common pharmacophore, group similar pharmacophores according the inter-site distance (the distance between pairs of sites in the pharmacophore) using tree-based partitioning methods, and (3) finally, the surviving hypothesis from these partitioning procedures will be scored and ranked. The quality of the pharmacophore alignment was assessed using the RMSD (distance tolerance set to 1.2 Å ) and the quality of the hypothesis assessed using survival score, sites, vector, and selectivity scores [see supplementary material for score description]. In the current setting, at least 4 common pharmacophore points must match with the selected active compounds and other settings were chosen as defaults.

Structural similarity search
The rapid overlay of chemical structures (ROCS) software [51] from OpenEye was used for shape-based structural similarity search [52]. Shape similarity can be determined, in part by comparing the shape of a query molecule with a reference molecule. This program is particularly designed to screen a large database using a solid-body optimization process (uses only heavy atoms for superposition) that maximizes the molecular overlap (volume and atomic features). ROCS uses the ImplicitMillsDean force field that defines six pharmacophoric color fields, such as hydrogen bond donor, hydrogen bond acceptor, hydrophobic region, anions, cations, and rings. Once the overlap is optimized, the shape similarity is computed using TanimotoCombo (ranges 0-2, high score indicates the higher shape and color similarity). TanimotoCombo provides the scores that include both shape similarity fit and color similarity fit (chemical pattern). The ROC program reads SMILES, 2D, or 3D structures as input and generates possible low energy conformations using the OMEGA 3D structure generating module as implemented in ROC.

Fingerprint for Ligands and Proteins (FLAP)
The software FLAP [53] was used to build and validate ligand based virtual screening models. FLAP uses fingerprints derived from GRID molecular interaction fields (MIFs) and GRID atom types are characterized as quadruplets of pharmacophoric features. The GRID approach is a well assessed concept for determining energetically favorable interaction sites in molecules with known structure using chemical probes e.g., H, O, N1, and DRY probes which describes the shape, hydrogen bond acceptor, hydrogen bond donor and hydrophobic interactions, respectively. FLAP creates a common reference framework in two steps: first, the MIFs of the molecules are calculated using the GRID force fields, and the resulting MIF's are condensed in complexity by extracting points (quadruplets or hotspot) representing the most favorable interactions. Second, each quadruplet of these points is used to generate different superpositions of the test molecules onto a template molecule. The quadruplets of each molecule are stored as a pharmacophoric fingerprints and used to evaluate their similarity. Superposition of quadruplets is assessed through Probe scores and Distance score, which represents the degree of overlap of the MIFs for each probe individually as well as for their combinations and overall difference of probe score between the ligand and template, respectively. In addition, FLAP calculates Global Sum score (Glob_Sum) and Global Product scores (Glob_Prod). The former score is produced by summing all the scores of the individual probes together and the later score is produced by multiplying all the scores of the individual probes together. All similarity measures ranges from 0.0 (bad) to 1.0 (good), except the Distance, where 0 is good. GRID probes H, N1, DRY and O were used for the FLAP modeling. The rational choice for these four specific probes is justified by the fact that they represent distinct ligand-protein interaction features. The distance (i.e. spatial resolution) between two GRID points was set to 0.75 Å .

Automated docking and scoring: preparation of protein and ligands
A computational model of the HIV-1 RT associated RNase H domain was constructed from an X-ray crystal structure with resolution of 1.4 Å from the Protein Data Bank (PDB ID: 3QIO) [15]. In the deposited crystal structure of RNase H domain with N-hydroxy quinazolinediones (bound active site inhibitor), 12 residues were missing and the structure was determined with manganese (Mn 2+ ) ions instead of the two catalytically active magnesium ions. Atomic coordinates for the missing residues were generated using the Swiss-Model [54]. Subsequently, the protein model was imported into the Maestro module available in the Schrödinger package and the protein was further optimized using the Protein Preparation Wizard [55]. This optimization includes adding hydrogen atoms, assigning correct bond orders and building di-sulfide bonds and replacing the Mn 2+ ions with Mg 2+ ions. The protonation states of all of the ionizable residues were predicted by PROPKA [56] provided in the Protein Preparation Wizard in the presence of the Mg 2+ ions at the active site. An optimized structure model was energy minimized (only hydrogen atoms) using the OPLS2005 force field.
The receptor grid generation module of Glide [57] was used to define the active site for the docking experiments. As this protein model has a bound ligand (3-hydroxy-6-(phenylsulfonyl)quinazoline-2,4(1H,3H)-dione), the ligand was set as the centroid of the grid box (size of the active site is 20 Å from ligand position). Water molecules in the active site beyond 3 Å from the bound ligand were deleted.

Ligand flexible docking protocol
Glide (version 5.8), a grid-based exhaustive search algorithm was used for all docking experiments [57]. Glide uses a series of hierarchical filters to find possible ligand pose in the active site, and the program has the option to treat the ligand fully flexible or rigid during the docking run. In addition, glide provides three docking precision modes, namely, XP (extra precision), SP (Standard precision) and HTS (High-throughput screening) modes. Each mode are used in slightly different context, e.g., the HTS mode is used to screen a relatively large database (uses more restricted conformational sampling), the SP mode uses a softer scoring function that adapt at identifying ligands that have a reasonable propensity to bind in the receptor, and the XP mode uses a complete minimization, and scoring (and additional terms used over SP, e.g. solvation) from large ensembles of docking poses (requires more CPU time), thus this mode is specially used for topranked compounds. Glide uses an in-build docking scoring function resulting in a Glidescore (SP and XP). In the current setting, all three docking modes were analyzed for the virtual screening. Together with the different ligand preparation settings (original and Epik process [58] from LigPrep) this resulted in a total of 6 docking runs (Table 1) for all sets of ligands in the dataset. Epik is an application that generates possible protonation states, tautomers and metal binding sites in the ligand.

Validation of virtual screening models
The computational models of each experiment were validated according to their assessment parameters, e.g., the pharmacophore goodness was evaluated according to its survival score. The quality of the machine learning models was assessed using the terms, Sensitivity (the proposition actives being predicted as actives), Specificity (proposition of inactive being predicted as inactive), Gmean (a measure of balanced prediction of each of the two classes) and Matthews's correlation coefficient (MCC) (The quality of the overall binary classification model and models are considered to be predictive if MCC is higher than 0.4.). In addition, 10-fold cross validation was also carried out as implemented in Weka from KNIME. The virtual screening effectiveness is evaluated in terms of the enrichment factor (EF) as a function of the percentage6that is covered of the complete ranked database, meaning that the proportion of active molecules found from a test database in which a small number of actives have been hidden in a large inactive compounds. Thus, where f Active is the percentage of the actives found after assessing x% of the ranked database. Successful screening implies EF..1.
In addition to the EF, the enrichment is also assessed with a receiver operating characteristic (ROC) value and area under the curve (AUC). This denotes how good the model is in discriminating active compounds from inactive. ROC/AUC values range from 0.0 (poor) to 1.0 (ideal) and 0.5 represents a random prediction.

Results and Discussion
As mentioned in the introduction the aim of the current study is to develop and validate virtual screening models that can efficiently be used to find highly selective and potent HIV-RT RNase H inhibitors. To achieve this goal, we used conceptually different in silico methods as detailed above. Below, we will present and discuss our virtual screening experiments for efficient identification of inhibitors for the targeted protein.

Characterization of the dataset
The drug-likeness of the dataset was assessed by calculating the Lipinski's ''rule-of-five'' descriptors and count how many of these that fail each rule. The result shows that .97 of the compounds in the dataset did not violate any of the four properties (e.g., hydrogen bond donor, hydrogen bond acceptor, logP, molecular weight) and only 3% of the molecules in the dataset violates one or two rules. As mentioned in the method section developing models with large number of compounds (e.g., ,97000) significantly reduce the computational efficiency and highly imbalanced class distribution, which leads to over prediction of major class, e.g., active (minor class) predicted as inactive (major class). Thus we decided to select a subset of the compounds which represents the rest of the active and inactive compounds in the whole dataset. A subset consisting of 971 inactive compounds was selected using random sampling from KNIME, and projected to the rest of the inactive compounds in the dataset to check the chemical space of this subset and the rest of the compounds in the dataset using a principal component analysis (PCA). The PCA was computed using the SIMCA-P program (Version 10.0. Umetrics, Umeå,  Sweden). It was observed from the scatter plot (Figure 3) of the first two PCs (R 2 = 0.69, Q 2 = 0.59) that there is a small cluster located in the lower left part of the scatter plot and inspection of these compounds reveals that there is no structural similarity within this cluster. In addition, the scatter plot suggests that the subset (n = 971) clearly reflects the whole dataset. This indicates that this subset could be used for validation (as inactive class) of the virtual screening models. This step certainly improves the computational efficiency of the various model validations. Furthermore, the distribution of active and inactive compounds in the projected space was analyzed and the results suggest that, based on the physicochemical properties (2D) of both classes, there was no significant discrimination found between the two classes. However, inactive compounds slightly have a large number of heavy atoms, a larger topological surface area and a higher number of largest chains compare to the active compounds (see the scatter plots provided in the supplementary material, Figure S3, Table S1). In terms of hydrogen bonding capability, a greater number of acceptors were observed in the active group compared to the inactive group, which have a greater number of hydrogen bond donors. This indication is consistent with the common structural features of RNase H active site inhibitors as reported previously [15,17].

Pharmacophore models of RNase H Inhibitor
For the development of a pharmacophore, we have considered 19 diverse inhibitors having pIC 50 values from 5.0 to 8.6 against RNase H activity. Of the 19 compounds in the dataset, four were x-ray crystal bound inhibitors. The list of compounds used for the pharmacophore model is provided in the supplementary material ( Figure S4). The Pharmacophore models were generated with two scenarios; in scenario 1, all 19 compounds were used for model building (default). In scenario 2, we further impose that the x-ray crystal bound ligands must match in the final hypothesis. The scenario 1 resulted in two hypotheses, which contains AAAR (hypothesis-1, survival score: 2.54, selectivity: 0.99 and matches: 11) and AARR (hypothesis-2, survival score: 2.40, selectivity: 1.12, matches: 11) features (A: Hydrogen bond acceptor, R: Ring). A schematic representation of the pharmacophore is shown in Figure 4. Scenario 2 also resulted in a AAAR feature, which is very similar to hypothesis 1 of scenario 1, except the inter-site distance. The inter-site distance measures the distance that separates the feature on the molecule from the centroid of the hypothesis feature. The hypotheses built from this study are in good agreement with previously established mechanism of RNase H inhibition, as most of the inhibitors chelates with magnesium ions in the active site. From the hypothesis it is observed that the  distance between HBA1 and HBA3 was 4.5 Å and distance between HBA2 with HBA1 and HBA2 is 2.6 Å . This observation is in good agreement with previously reported distances of Nhydroxyimides chelated with metals, three of the two oxygen atoms in N-hydroxyimides is positioned with divalent metal ions at a distance of 4 Å (i.e., the distance between two metal ions) and each metal ions interact with oxygen atoms distance 2 Å ) [22] ( Figure 4). Similar results were reported for hydroxytropolones chelates with Mg 2+ ions and the observed distance between two Mg 2+ found to be 3.7 Å [59]. Although the compounds used for model building have diverse chemical scaffolds, the resulting hypothesis still had similar distance pattern that favors the metal chelation in the active site.

Enrichment estimation from docking experiments
Reproducing the crystallographically observed conformation of the ligand is a minimum requirement to determine whether a docking setup is applicable to a given system. A receptor model obtained from Swiss-model was used for all docking experiments. Initially the N-hydroxy quinazolinedinone (NHQD) was prepared as described in the ligand preparation section and docked using the standard precision mode (SP) into the active site. Subsequently we compared the conformation and position with the bound ligand conformation measured in terms of the root-mean-squaredeviation (RMSD). The measured RMSD of the docking pose was 0.23 Å and we observed similar interaction with the active site residues as the bound conformation ( Figure 1B); for instance, HIS539 and Mg 2+ (2) interact with one of the three oxygen atoms of NHQD and the neighboring two oxygen atoms interact with the other Mg 2+ (1) ion and bound water molecules. It has previously been reported that the bound water molecules at the active site of RNase H plays a crucial role in RNA:DNA duplex digestion. It initiates the hydrolysis process through deproronation of water by Mg 2+ (1) and subsequently a OH 2 ion attacks the phosphate of the RNA strand and leads to breaking of the phosphodiester bond [16,17]. Subsequently, both sets of known active compounds, namely ''subset_active_lit'' and ''subset_ actives'' were docked in the same manner and analyzed for its hydrogen-bonding (HB) network with residues, including water and metal. The results indicate that more than 56% of the inhibitors from both sets form a HB network with H539 and 40% of the inhibitors interact and form a HB network with two water molecules namely H 2 O_17 and H 2 O_24 in the active site. Compounds such as N-hydroxyl quinazolinedinone, naphthyridine, pryrimidinol analogues, capravirine (pIC50 = 8.07), PD126338 and Illimaquinone (pIC50 = 5.26) were found to strongly interact with magnesium ions in the RNase H active site ( Figure 5).
As mentioned in the method section, in the present study the docking experiments were used to build virtual screening models, which will help in the identification of new compounds for RNase H inhibition. For validation purpose, the compounds in the set ''subset_inactive'' were also docked and scored. In this context, the docking scores (GlideScore) of both sets of active compounds were added into ''subset_inactive'', and the enrichment factor (Eq. 5) was calculated using an in-build script in the Maestro module of the Schrödinger package. The enrichment factor is calculated using the number of known actives screened from inactive compounds at different % of database screening for various docking scenarios. A summary of the enrichment estimation is provided in Table 1. The receiver operator characteristic (ROC) was used to measure the effectiveness of the screening. From the enrichment analysis (Table 1), it is clear that all the docking scenarios (1-12) perform remarkably well with respect to finding actives from compounds that do not bind in the active site. In general, the enrichment effectiveness of the docking scenario score from SP (Standard Precision) performs reasonably well (ROC.0.80-0.91) compared to extra precision or high-throughput docking modes. We note that it is not completely appropriate to directly compare different docking modes which have been developed for different screening situation in drug design purpose e.g. the extra precision (XP) mode is computationally expensive (,4 days for 1000 molecules using 4 CPU) and represents a significantly strict scoring procedure compared to SP. Models obtained from models 4-6 and 10-12 performed slightly better than models 1-3 and 7-9. Some models showed early and late enrichment, for instance, model 8 from ''subset_actives_lit'', which found 53% of actives at 2% of screening (early enrichment) and 100% of actives at 77% screen. On the other hand, model 2 from subset_active found 30% of actives at 2% of screening and 100% of actives at 58% of screen (late enrichment). In addition, both active datasets were merged and the EF was calculated. The obtained merged models (13)(14)(15)(16)(17)(18) were found to perform very similar to models 1-12 (see Table 1). From the enrichment analysis, one may chose a scenario that finds a considerable number of actives at early screening from large database.

Machine learning classification models
Recently, machine learning (ML) methods have frequently begun to be used as virtual screening method, because these methods are able to relatively easy handle large datasets for building models to be applied to new predictions. In addition, ML methods also quickly find chemical patterns within the activity classes. Here, we developed various ML models for RNase H inhibition and non-inhibition using physicochemical properties (2D descriptors) and substructure fingerprints together with the BestFirst and the GeneticSearch algorithms as attribution selection methods. For classification, all active compounds (n = 741) and the compounds in ''subset_inactives'' were merged to build ML models. In total 18 models were developed to check which combination of descriptor and method performs better for classification ( Table 2). The quality of the models was evaluated in terms of MCC and G-Mean of Test set prediction. Overall, models developed from random forest with descriptors and fingerprint performs better than other methods such as SVM or KNN. The models developed from other methods were efficient to predict non-inhibitors (.70%) compared to inhibitors (,70%). It is highly important to have models that are able to correctly predict both classes in a reasonably balanced manner, and not only correctly predicts one of the classes with high accuracy. The random forest model (4) using 111 descriptors predict 71% of inhibitors and 73% of non-inhibitors correctly with reasonably good coefficient (MCC = 0.44), and the quality is also reflected in the high G-mean score (0.72). In addition, if the random forest method in combination with the GeneticSearch algorithm on fingerprints (include MACCS and sub-structures) was used for model building (10), this resulted in a slightly reduced sensitivity and no change in the specificity. Overall the accuracy of the models (13)(14)(15)(16)(17)(18) developed from combinations of fingerprint and descriptors did not change significantly (MCC = 0.43 and accuracy = 72%) compared to models developed only with descriptors (MCC = 0.44 and accuracy = 72%). However, the correct prediction [16] of inhibitors increased from 71 to 73% and noninhibitor prediction slightly reduced from 73 to 71%. We achieved the best ML models with random forest in combination with 2Ddescriptors.
Han et al. [25] reported classification models using a decision tree method with approximately 99768 compounds, in which 770 compounds were inhibitors and 98998 compounds were noninhibitors. Their training set showed an accuracy of 99% with significantly good sensitivity (83%) and specificity (99%). However, a ten-fold cross validation correctly predicts only 57% of the inhibitors, which is nearly the random prediction. This poor prediction might be due to an imbalanced distribution of active and inactive compounds in the dataset as this is a very common issue in QSAR/classification modeling with HTS dataset [60].

Validation of queries from similarity search
The rapid overlay of chemical structure (ROCS) tool is frequently being used for similarity based structure search in lead finding/optimization and it is relevant to validate which query molecule captures the majority of inhibitors shape and pharmacophoric information which discriminates from non-inhibitors. To the best of our knowledge, this is the first time that a large set of compounds has been used as query molecule for validation. The best query molecules could be used for virtual screening to identify chemically diverse RNase H inhibitors. To do this, we used the ''subset_actives_lit'' and ''subset_active'' datasets as query molecules and validated against each dataset using ''subset_inactive'' as decay dataset. For example, each one of the ''subset_actives_lit'' dataset compounds was used as a query molecule and checked if a particular inhibitor's shape and chemical pattern (ColorCombo) are able to find ''subset_actives'' (n = 20) from inactive compounds (n = 971). The enrichment factor and ROC were calculated (Table 3, Figure 6). It is clear from Table 3 that the quality of validation of each query molecule show a mixed performance, for example, some query molecules such as Nevirapine, DHBNH, PD126338, Pyrimidinol 6, CID776870 and CID 5340811 have good early enrichment, but their overall enrichment in screening is found to be poor, which is reflected from the low ROC.   Compounds such as Efavirenz, Naphthyridine-7, Hydroisoquninoline-2/20, Pyrimidinol 4, NSC727447 and Capravirine showed good enrichment throughout the screening process and the ROC for these query molecules were significantly better (0.65-0.76). Interestingly, Illimaquinone shows poor performance in the beginning, but after 2% of the database screening, it seems to be very good, which is reflected in the good ROC value (0.8).
Comparison of the best query molecules in terms of their shape and pharmacophoric features is shown in Figure 7. Looking at their color features, many of them have good agreement with hypothesis 1 (AAAR) derived from the pharmacophore modeling. Very recently, a compound DHBNH has been used as a query molecule for screening of the NCI database [26] and successfully found 4 micro molar inhibitors (2-22 mM). In the present similarity validation, DHBNH has also been used as query molecule, and shown good early enrichment. However the overall performance is nearly random (ROC = 0.5). From the results discussed from the ROCS validation, one can use the ROCS method not only to find similar compounds, but after thorough validation with a diverse set molecules, the query can potentially be used to find diverse molecules that will show same bioactivity.

Fingerprints for ligands and proteins (FLAP) models
The FLAP program was used for molecular interaction fields (MIFs)-based pharmacophore virtual screening. Here the sub-set_active_lit dataset was used to develop pharmacophore and subsequently applied for validation. FLAP generates possible ensembles of conformations for each active molecule and keeps the pharmacophorically similar conformers. Subsequently, the conformers were aligned based on the pruned tree search. The aligned models were used to generate a pharmacophorically psuedomolecule, which consist of pharmacophoric points. The quality of an aligned model was assessed using the fitness score and AUC value. The best 5 aligned models were analyzed. Each model uses different sets of molecules for alignment and the fitness score ranges were from 0.85 to 0.95 (Table 4, Figure 8). The best enrichment was found from model 3 using the Glob-Prod score with an AUC of 0.73 and followed by models 1, 4, 5 which are based on Glob-Prod and H (denotes shape) similarity parameters with an AUC of 0.69. However, the quality of all models was improved significantly when removing compounds that have less global similarity score (also called the S-score) in every model, for instance, model 1 was refined by removing 11 actives (having S-score less than 0.8) from the alignment. The resulting pharmacophore model shows relatively good enrichment using Glob-Prod parameters and yielding an AUC of 0.76. In a similar manner, all the remaining models were refined and showed to possess good enrichment using various similarity parameters such as N1 (hydrogen bond acceptors), O (hydrogen bond donor) and H (shape) score.  Comparison of the developed models Comparison of the efficacy and efficiency of the developed models needs to be done with caution since the methods used in this study are conceptually different and possesses their own pros and cons. However, from a virtual screening point of view the models were compared in terms of how quickly they were able to find known inhibitors from a large set of non-inhibitors. Among the methods used to efficiently identify new RNase H inhibitors, the best performances were achieved with structure-based models such as docking (83-91%), followed by FLAP (76%), similarity search (74-77%), and machine learning e.g., random forest (72%). A summary of the time scale for new predictions based on the use of the different methods is provided in Table 5. The docking experiments produced the models (2,5,8,11,14,17) with the highest performance using standard precision (SP). Considering the methods efficiency for new prediction, ligand based methods such as similarity search and machine learning models are very efficient, for instance setting up the machine learning based models for new prediction would not take more than an hour for approximately 20000 compounds (this process includes dataset set preprocess, descriptors calculation and new prediction). However, the efficacy of ML models is slightly less than the other methods. Compounds such as NSC727447, Hydroxyisoquninoline_2/20, Naphthyridine_7 and pyrimidinol_4/5 could be used as reference molecules for new inhibitor search. From the ROCS experiment, it was noted that these molecules potentially captures most of the available known inhibitor structural information (ROC = 66-77%) and observed very good early enrichment in the screening experiment. From an efficacy point of view in virtual screening, ROC and ML methods could be the methods of choice. Alternatively, docking experiment with standard precision setup (in Glide) could be used for high enrichment in the virtual screening process. In addition docking experiments not only take advantage over the other methods in terms of high enrichment, but also proposes energetically favorable poses of the ligands in the binding pocket of a protein and thereby predicts key interactions. However, from an efficiency point of view docking predictions might be slow for large chemical database searches. Although the FLAP model predicts reasonably well (ROC = 0.76), the efficiency could be slow as compared to the other models. It is important to analyze if the ligand based pharmacophore model and the best ranked docking pose of the validation compounds suggest similar biological interactions or not. In order to compare this, the first ranked docking pose (XP scoring) for all validation compounds were superimposed with hypothesis 1 (AAAR). The result suggests that the information derived from the ligand-based pharmacophore model (AAAR) merely reflects the complementary receptor as shown in Figure 9. As mentioned previously, hydrogen bonding acceptors play an essential role in binding with metal chelation.

Conclusions
In the present study we have developed virtual screening models from various conceptually different computational methods (structure and ligand based) to efficiently identify HIV-1 RT associated RNase Inhibitors. Models based on the docking experiments are superior to other ligand-based models. Considering efficacy and efficiency, models trained using random forest with combination of 2D-descriptors and structural fingerprints are highly efficient methods, however, docking experiments correctly predicts more than 80% and are relatively less time consuming, and these could thus be useful for new prediction. In addition protein-ligand key interaction information from docking could be useful for lead optimization e.g. after virtual screening.
Moreover, the hypothesis generated from pharmacophore modeling and molecular interaction fields from the FLAP experiment illustrates the importance of hydrogen bonding acceptors and hydrophobic features. These features are in good agreement with the previously reported x-ray structures of HIV-1 RT associated RNase H. The pharmacophore model suggests that the inter-distance between hydrogen bond acceptors play a key role in inhibition of RNase H domain through metal chelation. The observed distances between HB acceptors are consistent with previous experimentally measured distances. As discussed, the docking procedure remains the method of choice for virtual screening; however, random forest and similarity search method could also be used alternatively to efficiently screen large database in order to establish new RNase H inhibitors for highly active retroviral therapy.  File S1 Structure and activity for the subsets used in this study are provided separately as ''Subset.sdf''. (ZIP)