UFSRAT: Ultra-Fast Shape Recognition with Atom Types –The Discovery of Novel Bioactive Small Molecular Scaffolds for FKBP12 and 11βHSD1

Motivation Using molecular similarity to discover bioactive small molecules with novel chemical scaffolds can be computationally demanding. We describe Ultra-fast Shape Recognition with Atom Types (UFSRAT), an efficient algorithm that considers both the 3D distribution (shape) and electrostatics of atoms to score and retrieve molecules capable of making similar interactions to those of the supplied query. Results Computational optimization and pre-calculation of molecular descriptors enables a query molecule to be run against a database containing 3.8 million molecules and results returned in under 10 seconds on modest hardware. UFSRAT has been used in pipelines to identify bioactive molecules for two clinically relevant drug targets; FK506-Binding Protein 12 and 11β-hydroxysteroid dehydrogenase type 1. In the case of FK506-Binding Protein 12, UFSRAT was used as the first step in a structure-based virtual screening pipeline, yielding many actives, of which the most active shows a KD, app of 281 µM and contains a substructure present in the query compound. Success was also achieved running solely the UFSRAT technique to identify new actives for 11β-hydroxysteroid dehydrogenase type 1, for which the most active displays an IC50 of 67 nM in a cell based assay and contains a substructure radically different to the query. This demonstrates the valuable ability of the UFSRAT algorithm to perform scaffold hops. Availability and Implementation A web-based implementation of the algorithm is freely available at http://opus.bch.ed.ac.uk/ufsrat/.


Introduction
The concept of molecular similarity has been exploited in nearly all chemical fields and has been used to great effect in the pharmaceutical industry to reduce the massive cost of drug development [1][2][3]. When molecular similarity is employed in ligand-based virtual screening it offers the ability to carry out searches for actives where little is known about the drug receptor, only molecules which bind to it [4][5][6][7][8]. Structurally similar molecules can exhibit similar biological properties and may therefore bind to receptors, making the same or equivalent interactions as the native ligand [6,9]. Molecular similarity and more specifically, scaffold hopping also provides a route to 'rescue' problematic drug leads which may well be inhibitors of a protein, but are unsuitable for further development due to problems with pharmacology, pharmacokinetics or patent issues [3,10]. Scaffold hopping describes the discovery of a compound with the same or similar bioactivity as the query compound but with a different core molecular structure. Successful scaffold hopping methodologies commonly describe the virtual compound in a way that encodes both the 3D shape of the molecule and the electrostatic and hydrophobic properties. This is key to successful lead discovery because electrostatic and van der Waals interactions are very sensitive to bond geometry and distance. There is of course a direct correlation between the levels of detail encoded in molecular descriptors or force-field based approaches and computational resources. It is essential to develop algorithms that can succinctly capture the essential molecular features and then search very large databases in a computationally efficient manner. We have developed the idea of capturing molecular shape using parameters determined from the interatomic distance distributions first proposed by Ballester and Richards [11,12] and incorporate these pre-calculated molecular descriptors into a searchable database of available compounds [13].
In this paper we describe the use of our UFSRAT algorithm (an expansion of the validated [14][15][16][17][18][19] USR technique) in virtual screening pipelines to identify inhibitors of two unrelated enzymes; FK506-Binding Protein 12 (FKBP12) and 11β-hydroxysteroid dehydrogenase type 1 (11β-HSD1). FKBP12 is a peptidyl-prolyl isomerase, catalysing protein folding [20][21][22] and is a therapeutic target for Parkinson's and Alzheimer's disease [23]. The enzyme 11β-HSD1 catalyses the intracellular biosynthesis of the active glucocorticoid steroid hormone cortisol which plays a central role in glucose homeostasis and the inflammatory response [24,25]. Inhibitors of 11β-HSD1 have been investigated for targeting cardiometabolic diseases such as type-2 diabetes, as well as glaucoma, osteoporosis and Alzheimer's disease. Cellular and direct binding assays show that UFSRAT successfully identified highly active non-steroid inhibitors with nanomolar activity.

Methods
Computational Methods: Ultra fast shape recognition with atom types Applying the Ultra Fast Shape Recognition with Atom Types (UFSRAT) technique to a query molecule and a candidate molecule returns a measure of similarity between the two. This process consists of three steps: first shape and atom property descriptors are calculated for each molecule; second, the descriptors are compared using a scoring function and finally similar molecules are ranked by score. Ballester and Richards outlined Ultrafast Shape Recognition (USR) [11,12] an algorithm that can be used to assess the shape similarity between two molecules. With this approach, no distinction is made between different types of atom. UFSRAT has been developed using USR as the base concept but additionally encodes hydrophobic and electrostatic information; features of the molecule that are important in molecular recognition.
Calculating descriptors. UFSRAT calculates 4 descriptor sets for each virtual molecule; each set encodes different features of the molecule. The distributions used to generate UFSRAT descriptors are: All atoms (shape), hydrophobic atoms, hydrogen bond acceptor atoms and hydrogen bond donor atoms.
Determining which atoms within a molecule should be considered for each distribution requires atom type information to be calculated. Atom typing can be described as applying a user definable atomic mask. This is achieved using the same implementation present in the high throughput virtual screening code LIDAEUS [26,27]. Fig. 1 illustrates the generation of UFSRAT distributions whilst Fig. 2 shows how the descriptors are calculated. 12 descriptors are calculated for each of the 4 distributions resulting in 48 descriptors for each molecule. This is an additional 36 compared to USR which only uses an all atom distribution. The 12 descriptors for each set are calculated from the geometric atom distributions. Fig. 2 illustrates a simplified all atom distribution descriptor generation process applied to a 2D small molecule (note that hydrogens are ignored). Calculating the 12 values takes the following form; (i) Atomic coordinates are used to define the centre of the volume occupied by the molecule as point 1 (P1). (ii) A distribution of Euclidean distances of all atoms to P1 is generated. (iii) From this   Comparing molecular descriptors. The 48 geometric distribution descriptors are pre-calculated for each candidate molecule in the database and stored as vectors M c . M q represents the vector for the query molecule. The USFRAT scoring function is used to generate a similarity score, S qc , between the query molecule and each candidate molecule in the database. This is illustrated in Fig. 3. The similarity score S qc is a single numerical metric pertaining to the predicted similarity of the molecules. As candidate molecules are tested against the query, a sorted list of the top matches is kept and returned as the result upon completion.
Ranking compounds by similarity. The similarity scoring function S qc is an extension of the method used by USR, calculating a weighted difference for each distribution and scaling the result between 0 and 1. UFSRAT is therefore typically used to screen a large database of potential candidate molecules against a query resulting in a ranked list of arbitrary length such as the top 500 similars.
UFSRAT-execution. A high speed version UFSRAT is currently deployed on the University of Edinburgh's EDULISS system [13]. EDULISS is a web based molecular database mining tool. Containing 3.8 million unique compounds, the system employs a number of methods for similarity searching. The inclusion of UFSRAT into this system as a similarity method has enabled the pre calculation of UFSRAT descriptors for each of the 3.8 million molecules. Storage of pre-calculated descriptors is achieved using one binary data file. Storing all required information along with a unique identifier for a molecule requires 200 bytes. This is made up of 8 bytes for the identifier, and 192 bytes for the 48 descriptor values. UFSRAT therefore requires 730 megabytes of data to represent 3.8 million compounds in a condensed form suitable for similarity searching. Pre-calculating the descriptors in this way enables UFSRAT to read in the query molecule supplied by the user and then only ever calculate descriptors for the query. The scoring is then performed on the pre-calculated descriptors along with the newly generated query molecule descriptors. Upon completion, unique identifiers are used to retrieve top scoring molecules from the database and return them to the user in the form of a multi-molecule MDL-SD file. This approach results in a run time of around 8 seconds to query a molecule against the whole database of candidate molecules and supply results to the user using a modest 2010 linux-based server running on 1 processor. Linear multiprocessor speedup has also been achieved by removing disk operations from timings and parallelization using the OpenMP shared memory paradigm [28] on up to 8 processors.
The FKBP12 virtual screening pipeline UFSRAT was used to select compounds similar to known inhibitors of FKBP12 from the EDULISS database of small molecules. Three well characterised FKBP12 inhibitors were used to generate UFSRAT queries. The results of running these queries with UFSRAT acted as the starting point in a structure-based virtual screening pipeline (See Fig. 4). Around 50% of atoms contained within the high affinity macrocyclic immunosuppressants rapamycin and FK506 bind to FKBP12; the remainder exert a pharmacological effect by interacting with a second protein; mTor and calcineurin respectively [29]. Both rapamycin and FK506 share the same FKBP12 binding motif that occupies the majority of the active site making four hydrogen bonds. A high resolution crystal structure of FKBP12 in complex with rapamycin (2DG3.pdb), was used to select the buried fragment of rapamycin made up of atoms within 3.5 Å of FKBP12 [30], defining the query molecule Q1 (See Fig. 4). Two additional queries were selected from compounds from an NMR screen: 2-[(4-methylphenyl)sulfanyl]-1-(morpholin-4-yl)ethan-1-one, Q2, and (2-[(4-methylbenzene)-sulfonyl]-1-(morpholin-4-yl) ethan-1-one), Q3, (Fig. 4) [31]. The top 500 hits from UFSRAT were saved from each query (1500 total, 0.04% of the database) and docked into template 2DG3.pdb using the program LIDAEUS: grid spacing 0.6 Å; docking resolution parameter 0.04 Å [26,32]. Ligands matching the shape and atom types of the site-points were ranked according to scores describing the enthalpy and buriedness of the interaction and then filtered for drug-likeness using Lipinski's criteria and descriptors pre-calculated in the EDULISS data-base [33]. Manual inspection using PyMol identified ligands with predicted poses occupying the base of the pocket and made at least one hydrogen bond with Tyr82 or Ile56; hot spot residues in the active site of FKBP12. Hits containing the same scaffold as the query were not selected for purchase.

The 11β-HSD1 virtual screening pipeline
Carbenoxolone, a potent bi-directional inhibitor of 11β-HSD1, was used to generate a UFSRAT query [34] (see Fig. 5). The structure of carbenoxolone in complex with the enzyme is available in the protein data bank with entry ID 2BEL. Candidate molecules were selected from the EDULISS database, the FKBP12 workflow outlined in Fig. 4 was followed with the exception of the docking step.

Testing virtual hits
All reagents were of the highest grade available. Compounds were purchased from: Chembridge Corp., USA; InterBioScreen, Russia; Sigma Aldrich, UK; Specs, The Netherlands and TimTec, USA. Recombinant human FKBP12 was expressed and purified as described in [35]. The purification of recombinant human 11β-HSD1 is described by Adie [36]. Visualization of 3D structures was performed using PyMOL [37] (open source, DeLano Scientific LLC).
Thermal denaturation fluorescence (TDF). A fluorescence based thermal denaturation assay was used to detect binding to FKBP12 essentially as described in the study of Lo [38]. Final conditions were: 50 mM ammonium acetate, pH 7.0; 10 μM FKBP12; 250 μM screening compound and 5x SYPRO orange dye (Invitrogen). The compound used for Q2, 2-[(4-methylphenyl)sulfanyl]-1-(morpholin-4-yl)ethan-1-one, acted as a positive control (K d = 15 μM, 27°C) [31]. Protein was melted over the temperature range 20 to 80°C; 0.5°C increment, 30s hold. Compounds were screened for fluorescence in the absence of protein; all samples were solvent matched and repeated in triplicate on 3 separate experiments. Apparent equilibrium dissociation constants (K d app ) for the interaction between the protein and ligands were calculated from the change in mid-point melting temperature of the protein (T m ) on addition of ligand and transformed to 25°C using a benchmark value for the enthalpy of ligand binding of -5 kcal•mol -1 [39].
Electrospray ionization mass spectrometry. Electrospray-ionization mass spectrometry (ESI-MS) was used to detect binding to FKBP12. Data was collected on a LCQ TM DECA iontrap instrument (ThermoQuest) run in the positive ion mode. The analyte was introduced by direct infusion and the following operating parameters were employed: nitrogen flow 100units, analyte flow 6 μl•min -1 , capillary voltage 3.5 kV, cone voltage 25 V and capillary temperature 50°C. Spectrum were averaged over 300 scans (2s/ scan) and processed using the Bioworks Browser V 3.0 software package and ProMass 2.5 (ThermoScientific). The analyte solution consisted of 100 μM ligand, 5 μM FKBP12 in 10% methanol (v/v), 10 mM ammonium acetate, pH 6.8 incubated for 10 minutes at room temperature. K d app for the interaction between the protein and ligand were calculated from the ratio of the intensity of the ion signal from free protein and protein in complex with ligand using the method of Tjernberg [40].
Testing for inhibition of 11β-HSD1 reductase activity in cells. Candidate compounds were tested for inhibition of the conversion of cortisone to cortisol by 11β-HSD1 expressed in HEK-293 cells stably transfected with the full length human hsd11b1 gene as previously described [41]. An established scintillation proximity assay (SPA) was carried out in a 96-well plate format, with appropriate controls [42]. Experiments were set up in triplicate with 8 inhibitor concentration points per compound, spanning 3.5 orders of magnitude. Final solution conditions: 40 nM [ 3 H]-cortisone; 50 mM Tris pH 8.0, 50 mM NaCl; 1% DMSO. The anti-cortisol antibody was purchased from HyTest, Finland; radiometric quantification was carried out on a Perkin Elmer TopCount instrument. The percentage inhibition was determined relative to a non-inhibited control and the IC50 determined from a plot of percentage inhibition against compound concentration. Data were fitted global nonlinear regression using Kaleidagraph V4.0 (Synergy Software).
Testing for inhibition of 11β-HSD1 dehydrogenase activity. Candidate compounds were tested for inhibition of the conversion of cortisol to cortisone by recombinant h11β-HSD1. The dehydrogenation of cortisol is accompanied by the reduction of NADP + to NADPH; this was followed spectrophotometrically and scaled to a NADPH calibration curve. The excitation and emission wavelengths were 340 and 458 nm. (SpectraMax M5, Molecular Devices). The experimental conditions were 500 μM NADP, 1 μM 11β-HSD1, 200 μM cortisol 50 mM Tris, pH 7.7, 50 mM NaCl; 25°C. Reactions were carried out in sextuplicate. K i app were derived from a fit of the rate of the reaction as a function of inhibitor concentration; Kaleidagraph, V4.0 [43] Performance comparison of UFSRAT with USR using the DUD-E dataset. The Directory of Useful Decoys: Extended (DUD-E) dataset [44] was used to compare the enrichment performance of UFSRAT with USR. We also included the popular topological fingerprint ECFP4 [45] as implemented in the RDKit toolkit [46]. Using 102 crystallographic ligands as query molecules for targets within the DUD-E dataset, collections of molecules containing known actives for each target along with random small molecules were run using each similarity method. Statistics were then gathered on recall of known actives within the top scoring 0.5%, 1%, 2%, and 5% of results. An enrichment score was then given to each method against each target. An enrichment score of 1 denoting the number of hits found would be expected by random picking of compounds. A score of 2 denoting double the expected recall over random picking and so on.

Inhibitors of FKBP12
Of the 10 compounds tested, 4 compounds were hits in both the thermal denaturation fluorescence and ESI-MS screening assays. A compound was defined as a hit if it raised the mid-point melting temperature of the protein by at least 0.5°C and in ESI-MS if hits were seen to fly as an ion consistent with the mass of the protein-ligand complex. Structures of compounds 1 to 4 are available in S3 Fig. and affinity data is shown in Table 1. ESI-MS spectra of FKBP12 ligand complex are shown in Fig. 6.

Inhibitors of 11β-HSD1
Out of 26 compounds tested, 4 inhibited in cells and with recombinant human 11β-HSD1. IC 50 values for the SPA cell-based reductase assay and K i app for the fluorescent recombinant protein dehydrogenase assay are shown in Table 2. Structures of compounds 5-8 are available in S4 Fig. A compound was classified as a hit if it inhibited 11β-HSD1 by at least 40% in the cell-based assay with a concentration of 100 μM inhibitor and 25% inhibition in the recombinant protein assay. The fluorescence assay is less sensitive to inhibition due to the partial presence of carbenoxolone in the enzyme binding site as an artifact of the purification procedure. This is reflected in lower inhibition by the same compounds in this assay, but may also be due to differences in the ability of each compound to inhibit either the 'reductase' or 'dehydrogenase' forms of the enzyme.

Performance comparison of UFSRAT with USR using the DUD-E dataset
Comparing UFSRAT against USR in the recall of active molecules within the top 0.5%, 1%, 2% and 5% of scores shows that UFSRAT consistently outperforms USR when enrichment is averaged across the DUD-E targets. At the 0.5% level, average enrichments are 1.99 (USR) and 3.83 (UFSRAT), UFSRAT providing greater enrichment over USR for 75 out of 102 protein targets. At the 1% level, average enrichments are 1.70 (USR) and 2.87 (UFSRAT), UFSRAT providing greater enrichment over USR for 70 out of 102 protein targets. At the 2% level, average enrichments are 1.41 (USR) and 2.27 (UFSRAT), UFSRAT providing greater enrichment over USR for 67 out of 102 protein targets. Finally, at the 5% level, average enrichments are 1.09 (USR) and 1.22 (UFSRAT), UFSRAT providing greater enrichment over USR for 68 out of 102 protein targets. For individual proteins in the DUD-E dataset, the largest difference in enrichment when UFSRAT is outperformed is 5.6 for USR and 2.1 for UFSRAT, occurring at the 0.5% level. The largest difference in enrichment when UFSRAT outperforms USR is 33.7 for UFSRAT and 13.8 for USR at the 0.5% level. It is interesting to note that whilst UFSRAT Molecules 1, 2 and 3 were retrieved using query molecules Q1, Q2 and Q3 and the UFSRAT technique, screened against the EDULISS database.
Binding of the top hits was tested using thermal denaturation (TDI) and mass spectrometry (ESI-MS).  outperformed USR, both 3D techniques were greatly outperformed by ECFP4 in this test. ECFP4 displaying on average, 8.9-fold enrichment at the 0.5% level, 11.9-fold at the 1% level, 7.7-fold at the 2% level, and 4-fold at the 5% level. However, whilst ECFP4 performs superbly in this test, its performance can be accredited to the high degree of common substructures shared by DUD-E actives, and absent amongst decoys. The nature by which ECFP and other fingerprint methods operate are less amenable than 3D techniques to radical scaffold hopping to the degree demonstrated by UFSRAT in the discovery 11β-HSD1 actives. Exhaustive results for the recall of actives are available in the supporting data accompanying this report (see S1 Fig. and S1-S6 Tables).

Discussion
The nature of UFSRAT as a ligand-based molecular similarity technique enables ligand discovery whilst bypassing the need for any detailed structural information on the target. The 3D conformation of a substrate or inhibitor indirectly encodes 3D and electrostatic properties of the active form of the receptor. UFSRAT has an advantage over 1D and 2D-based molecular similarity techniques, describing a complimentary shape to the binding pocket, something which is not captured using 1D or 2D techniques. UFSRAT is also not prescriptive of bond order, specific sub-structures or molecular orientation but does take into account the electropotential and therefore has the ability to discover bioisosteres, strengthening scaffold hopping abilities as already demonstrated for 11β-HSD1. Defining the similarity between two molecules in an objective way is a non-trivial problem. Validating the success of a similarity program is difficult due to the absence of any absolute metrics of similarity. However, performance comparison using the DUD-E dataset clearly shows enhanced enrichment compared to the original USR method.
Comparison of query and candidate molecules by UFSRAT is sensitive to the conformation of the molecules in question and one might therefore infer that the program would have most success in biological assays if the receptor bound conformation of the query is known. In this study UFSRAT has selected biologically active compounds from a query structure with known conformation derived from a protein-ligand complex and a query where we have generated a low energy conformer from a 2D representation. The pre-calculation of descriptors by UFSRAT means that it is possible to place descriptors for more than one low energy conformation in the database of candidate compounds and also test multiple conformations of a query, thus reducing the chances of missing a hit because virtual molecules were not in biologically relevant conformations.
In this study UFSRAT retrieved the highest affinity hits for 11β-HSD1. For this target the correct biological conformation of the query carbenoxolone was available from crystallographic data. Interestingly, the top virtual hits were non-steroidal but did show similarity to known 11β-HSD1 inhibitors demonstrating the ability of UFSRAT to scaffold hop. By eye, carbenoxolone and all identified 11β-HSD1 active molecules are radically different however, a key pattern of atoms involved in hydrogen bonding is preserved when examined in 3D (see S2 Fig.). Also, a survey of 11β-HSD1 non-steroidal inhibitors has shown that the majority share a central group or linker containing an atom that can act as a hydrogen bond acceptor, typically oxygen, flanked on either side by lipophilic groups [41]. This molecular arrangement is observed with all the hit compounds in this study. Compound 6 contains a linker with a 1,5-sulphanyl 1H-tetrazole moiety. A series of bioactive compounds containing the same 1,5-sulphanyl 1H-tetrazole linker have been reported [41]. Compounds 6, 7 and 8 all share the same linker between the lipophilic groups. The highest affinity hit is compound 8 which showed low nanomolar affinity IC 50 in a cell based assay.
Compounds selected by UFSRAT that bind to FKBP12 with mid-micromolar affinity are similar to the query compounds in that they share two ring systems joined by a linker containing a motif that has the potential to form hydrogen-bonding interactions with hot-spot residues in the active site. The affinity of the query compounds Q2 (15 μM) and Q3 (95 μM) (see Fig. 4) for FKBP12 show that small changes to the linker have a significant affect on affinity. Stebbins reports a series of analogues of Q1 (Fig. 4); substituting-S-with-S0 2 -, -0-or-CH 2all result in a reduction in affinity [31]. The biophysical techniques used to assess binding all have the advantage of detecting low affinity ligands that have low solubility in solution. However, as measurements are carried close to 50°C it is likely, in the majority of cases, that the affinity would be higher at 25°C.

Conclusion
UFSRAT has been shown to be an efficient algorithm for selecting bioactive molecules from large databases. Compounds have been identified that bind to two clinically relevant drug targets FKBP12 and 11β-HSD1. A key feature of the results is that the compounds selected by UFSRAT do not have the same scaffold as query molecule, thus demonstrating the scaffold hopping capabilities of USFRAT.  Table. DUD-E profiling of USR and UFSRAT at the 0.5% level. (DOCX) S3 Table. DUD-E profiling of USR and UFSRAT at the 1% level. (DOCX) S4 Table. DUD-E profiling of USR and UFSRAT at the 2% level. (DOCX) S5 Table. DUD-E profiling of USR and UFSRAT at the 5% level. (DOCX) S6 Table. DUD-E profiling of ECFP4 at the 0.5, 1, 2 and 5% levels.