Combining Metabolite-Based Pharmacophores with Bayesian Machine Learning Models for Mycobacterium tuberculosis Drug Discovery

Integrated computational approaches for Mycobacterium tuberculosis (Mtb) are useful to identify new molecules that could lead to future tuberculosis (TB) drugs. Our approach uses information derived from the TBCyc pathway and genome database, the Collaborative Drug Discovery TB database combined with 3D pharmacophores and dual event Bayesian models of whole-cell activity and lack of cytotoxicity. We have prioritized a large number of molecules that may act as mimics of substrates and metabolites in the TB metabolome. We computationally searched over 200,000 commercial molecules using 66 pharmacophores based on substrates and metabolites from Mtb and further filtering with Bayesian models. We ultimately tested 110 compounds in vitro that resulted in two compounds of interest, BAS 04912643 and BAS 00623753 (MIC of 2.5 and 5 μg/mL, respectively). These molecules were used as a starting point for hit-to-lead optimization. The most promising class proved to be the quinoxaline di-N-oxides, evidenced by transcriptional profiling to induce mRNA level perturbations most closely resembling known protonophores. One of these, SRI58 exhibited an MIC = 1.25 μg/mL versus Mtb and a CC50 in Vero cells of >40 μg/mL, while featuring fair Caco-2 A-B permeability (2.3 x 10−6 cm/s), kinetic solubility (125 μM at pH 7.4 in PBS) and mouse metabolic stability (63.6% remaining after 1 h incubation with mouse liver microsomes). Despite demonstration of how a combined bioinformatics/cheminformatics approach afforded a small molecule with promising in vitro profiles, we found that SRI58 did not exhibit quantifiable blood levels in mice.


Introduction
Learning from experience in neglected disease drug discovery is essential for increasing timeand cost efficiencies. This requires we leverage and build upon computational methods widely used in industrial drug discovery [1,2]. For example, we have previously analyzed large datasets for Mycobacterium tuberculosis (Mtb) [3][4][5][6][7][8][9][10][11][12][13][14]-the causative agent of tuberculosis (TB). We have used these to build machine learning models that use single point data, dose response data [3,4], combine bioactivity and cytotoxicity data (e.g, Vero, HepG2 or other cells) [8][9][10] or combinations of the preceding [13,15]. The deliverables have been promising novel (or long-abandoned) antitubercular hits for further pursuit as well as strategies for hit-to-lead evolution and prediction of antitubercular in vivo activity in the mouse model of infection [14].
Whole-cell phenotypic high-throughput screening (HTS) against Mtb does not typically provide information on the potential target/s for the compounds and so other methods must be used for target identification [16,17]. For example, we have contributed computational methods that rely on similarity of compounds to inhibitors of known targets [17] to create TB Mobile 2 which applies a machine learning approach to predict target likelihood.
Since a small fraction of Mtb proteins are known to be modulated by approved TB drugs [7], a need exists to modulate other targets to avoid existing drug resistance mechanisms. We have focused initially on the targets that were essential to the growth and survival of Mtb [18], under in vitro and in vivo conditions [19], and ultimately declared respective lists of essential enzymes and their essential metabolites [6,7]. In an effort to discover inhibitors of 9 essential enzymes through their mimicry of the chemical structure of a given metabolite, 3D pharmacophores were used to screen over 80,000 commercial compounds. Ultimately after testing 23 candidate inhibitors or metabolite mimics (including 3 predicted inactives), 2 moderately active compounds were identified [7]. In the current study we have greatly expanded our approach to also assess targets that are in vitro but not in vivo essential. We computationally searched >206,000 molecules with 66 pharmacophores of Mtb essential metabolites or substrates and assayed 110 compounds in vitro. We have identified 3 compounds possessing whole-cell activity against Mtb. Two of the hits were further optimized in a drug discovery workflow. We demonstrate that this approach of computational metabolite mimicry is scalable to afford promising chemical entities and could be explored for other diseases and yet it is ultimately confounded by molecular properties that impact in vivo pharmacokinetics.

Small molecule information from CDD for new potential Mtb enzyme targets
Except for one of the 46 potential targets (MurE) identified (S1 Table) in our initial bioinformatics analysis (See Materials and Methods), none of the enzymes described have any small molecule inhibitors noted in the CDD Public database at the time of this study. Depending on various criteria like in vivo essentiality, whether or not X-ray crystallographic information was available in the Protein Data Bank (www.rcsb.org), a subjective interest in the constituent pathway/s, suitability of the structure for the enzyme substrate or product to facilitate mimic design (e.g. lack of charge), 20 targets were selected from the list of 46 enzymes as being of particular interest. These are TrpB, MetE, IlvD, FolK, HisC1, HsaE, End, BioF1, CobL, Ace, AccD1, SerB2, AmiD, HsaF, Rv1879, Tal, FabG, NuoD, ProA, and ArcA (bold are those encoded by a gene predicted to be in vivo essential, (S2 Table)). Reaction details including substrates and products (and their relevant SMILES strings) are provided for these 20 selected targets. As described previously [7] the TBcyc pathway database (http://tbcyc.tbdb.org/index.shtml), an Mtb specific metabolic pathway database, was used to extract this information (S2 Table). The TBcyc database was initially developed using SRI's Pathway Tools software that automatically generates a Pathway/Genome Database (PGDB) describing the genome and biochemical networks of the organism from the annotated genome sequence of Mtb [20,21].
In silico selection of putative metabolite and substrate mimics 14,733 commercial molecules were retrieved from over a set of 206,000 (from the Asinex Gold library) using the 66 pharmacophores (S1 Fig and S1 Model Files) based on enzymatic reaction substrate and product chemical structures and were suggested as potential mimics. These molecules were scored with three dual event Bayesian Mtb models (MLSMR, CB2, Kinase) [10,[22][23][24][25] in Discovery Studio [4,26,27]. All compounds were imported into CDD. 110 molecules were selected for purchase given pharmacophore scores greater than 2.5 (higher scores are better), 'active' scores in all 3 dual event models, and successful visual filtering (e.g., absence of reactive functional groups) [28].  Initial pharmacophore/Bayesian model-derived hits: A) chemical structures, in vitro antitubercular activity, and B) best fit to menadione pharmacophore of BAS04912643, C) best fit to menadione pharmacophore of B. BAS00623753 (grey). D. best fit to indole-3-acetamide pharmacophore of BAS7571651, E best fit to lipoamide shape of BAS7571651. The pharmacophores consist of hydrogen bond acceptors (green) hydrogen bond donors (purple) and hydrophobic features (blue). The van der Waals surface was used to limit the number of compounds retrieved when screening the vendor library.

Hit exploration
We have further explored the structure-activity relationships (SAR) for the two most potent hits. Initial efforts with BAS 00623753 consisted of the synthesis of the initial hit along with 13 analogs (Table 1, details as to the synthesis and characterization of all compounds may be found in the S1 Data). The alterations included: removal of the nitro group from the aroyl ring or its replacement with an electron-donating group (CH 3 ) or other electron-withdrawing groups (F, CF 3 ); α,α-dimethylation of the one-carbon tether between the amide nitrogen and the heteroaryl group or its homologation; and replacement of the 2-pyridyl moiety with differentially substituted pyridines or other heterocycles of the pyrazine and pyrimidine families. Their syntheses (Fig 2) were realized through the facile coupling of the aroyl chloride and  amine partners in moderate to good yields. The small molecules were then assayed for their growth inhibition of Mtb. Disappointingly, the synthesized version of BAS 00623753 exhibited an MIC 50 μg/mL as did the other analogs. The original commercial sample that demonstrated promising whole-cell efficacy was no longer available and thus an analytical comparison of the two materials was not feasible. BAS004912643 (1) was identified as a potential substrate mimetic of menadione (Fig 1) and demonstrated an MIC against Mtb of 2.5 μg/mL. To validate the hit, we developed an SAR for both antitubercular efficacy and the cytotoxicity to model mammalian (Vero) cells through determination of the CC 50 (amount of compound to inhibit cell growth by 50%). Structural queries of the Available Chemicals Directory (ACD) (http://accelrys.com/products/databases/ sourcing/available-chemicals-directory.html), SciFinder (http://www.cas.org/products/ scifinder) and eMolecules (www.emolecules.com) were performed to identify structural analogs of 1 available for purchase. Of the commercially available analogs, only two compounds (quinoxaline di-N-oxides 2 and 3, Table 2) were subjectively viewed as sufficiently similar while also being predicted to be whole-cell active through our Bayesian models. These analogs were purchased and tested for their antitubercular activity ( Table 2). Their whole-cell efficacy was confirmed experimentally. To further establish an SAR for the quinoxaline di-N-oxides, we used a concise synthetic route that consisted of heating a benzofuroxan with a 2,4-pentanedione in the presence of silica gel (Fig 2). This one-step reaction gave acceptable yields (20-80%) of desired product, though often generated regioisomers depending on the nature of the benzoxadiazole N-oxide. The regioisomers were generally separable via flash chromatography and both isomers were tested for activities; in some instances chromatographic separation of the isomers was not achieved. This method was utilized to prepare 62 analogs that have been fully characterized via LC-MS and 1 H NMR spectroscopy and tested for their MIC value against Mtb (Table 2).
A range of aliphatic groups appeared to be tolerated at R 1 in deference to an ester (52) or acid (53) where the MIC was >40 μg/mL. A small set of substituents (H, Cl, CH 3 , OCH 3 , NO 2 ) was examined at the 5-and 6-positions of the benzofuroxan input to afford final compounds with potencies that varied significantly depending on the other substituting groups in the quinoxaline. The two most potent antitubercular compounds were 50 (MIC = 0.32 μg/mL) and 12 (MIC = 0.64 μg/mL).
Eight analogs with an MIC 5.0 μg/mL, in addition to the original hit 1, were also tested for cytotoxicity to Vero cells to assess the selectivity for antimicrobial activity relative to cytotoxicity (SI = CC 50 /MIC) (Fig 3). Three compounds exhibited an undesirable SI < 10 (SRI12, SRI54, and SRI56). Amongst the five satisfying this SI criterion, SRI57 and SRI58 both demonstrated SI > 32.
Given their promising in vitro activity and cytotoxicity, SRI50 and SRI58 were profiled for kinetic solubility in pH 7.4 PBS, mouse liver microsomal stability, and Caco-2 cell permeability (Table 3). Due to its structural similarity to these two analogs, SRI54 was also tested in this panel. SRI50 and SRI58 were approximately eight-fold more soluble than SRI54. SRI58 exhibited significantly greater mouse liver microsomal stability than the two other analogs. With all three quinoxalines, metabolism appeared to be NADPH-dependent. All three compounds exhibited comparatively low Caco-2 cell permeability (P app < 10 x 10 −6 cm/s) in both directions with efflux not being a significant issue (P B-A /P A-B < 3). Poor recovery of the compounds, due to either low aqueous solubility and/or non-specific binding to the cell monolayer, may have affected the overall measurements of compound concentration on each side of the monolayer (especially those of SRI50, for which permeabilities could not be quantified). SRI58 (formulation: 10% DMA/90% (20% Solutol in citrate buffer pH 3.5)) did not exhibit quantifiable blood levels in mouse pharmacokinetic studies (iv and po; data not shown).

In vitro activity against MDR-TB
To avoid pursuing hits modulating biological targets pertinent to known antitubercular drugs, we tested the most potent quinoxaline di-N-oxide SRI50 for activity against clinical MDR-TB strains with known drug resistance profiles [29]. SRI50 showed potent activity against clinical susceptible as well as clinical MDR-TB strains comparable to the laboratory H 37 Rv strain suggesting a novel mechanism of action for this series (Table 4).

Mechanism of action studies through transcriptional profiling
To gain insight into the effect of these quinoxaline di-N-oxides on Mtb, we turned to transcriptional profiling [30,31]. Summarily, Mtb grown on Middlebrook 7H9 supplemented with OADC, Tween 80, and glycerol was treated with SRI54 at 3.2 μg/mL (1.3X MIC) for 6 h in quadruplicate and subsequently mRNA was isolated. Microarray studies (fold changes in Mtb genes may be found in S3 Table) allowed determination of the effects of SRI54 on Mtb transcript levels as compared to a DMSO-only control. Overall, 131 genes were up-regulated at least twofold versus control and 184 genes were down-regulated at minimum twofold versus control. It is noteworthy that 3/13 genes (prpC, prpD, and icl1) in the methylcitrate cycle [32,33] were induced more than fourfold versus control. The other genes (10/13) in the cycle were not significantly affected by SRI54 as compared to control. In addition, 12/59 genes involved in DNA repair [34] were up-regulated and solely 2/59 were down-regulated at least twofold as compared to the control. Two genes in leucine biosynthesis (leuC and leuD) were down-regulated greater than equal to twofold versus control samples. These two genes encode for the two subunits of isopropylmalate dehydratase [35]. A number of genes involved in the FASII pathway [36] were similarly down-regulated, including kasA, kasB, and InhA. Finally, consideration of the top 100 most up-regulated and top 100 most down-regulated genes of Mtb when exposed to SRI54 as compared to no-drug control through clustering with deposited Mtb transcriptional data (environmental stresses and small molecule antituberculars) [30] was performed via hierarchical clustering (Fig 4). It is noteworthy that SRI54 clustered most closely to the fatty acid biosynthesis inhibitor CD117, which modulates both short-chain fatty acid and Table 3. Physiochemical and ADME data. For microsomal stability, verapamil was used as a high-metabolism control (0.24% remaining with NADPH) and warfarin was a low-metabolism control (85% remaining with NADPH). The kinetic solubility limit was the highest concentration with no detectable precipitate. For Caco-2 cell permeability, compounds at a concentration of 10 μM were incubated for 2 h. P app = apparent permeability coefficient. All compounds showed poor recovery due to either low solubility or non-specific binding. Ranitidine, warfarin and talindol were used as low permeability, high permeability and P-gp efflux, controls respectively. mycolic acid biosynthesis, [37,38], and secondarily to two known protonophores, 2,4-dinitrophenol and carbonyl cyanide 3-chlorophenylhydrazone [30].

Computational Target Prediction
It should be noted that the postulated targets for which the initial three hits were retrieved (Fig  1) are not represented in the TB Mobile Apps (version 1.0 or 2.0) used [16,17], although they are in similar property space as the respective training sets when visualized by PCA ( S2 Fig). Therefore, the predictions may suggest additional targets, which could be followed up with biochemical and/or microbiological studies. Using TB Mobile version 1.0 for BAS 04912643 the closest hit is pyrazinoic acid (S3A Fig), which is predicted to be similar to compounds that target DeaD, Mfd, RecG, DinG, and NrdR. PCA clustering also places this compound in a predominantly FabH cluster (S4 Table). For BAS00623753 the closest hit targets the UDPgalactopyranose mutase Glf (S3B Fig) while PCA clustering places it in a MurB cluster. Finally for BAS07571651 the closest hit targets Glf (S3C Fig) and PCA clustering places it in a QcrB cluster [39]. TB mobile 2.0 with its larger database, use of ECFP_6 fingerprints for similarity analysis and addition of Bayesian models for targets produced some differences in predictions When utilizing similarity criteria for each of the three hits to infer target, the following hit-target pairings were predicted: BAS04912643-CysS, BAS00623753 -Glf, and BAS07571651-InhA.

Discussion
Combining bioinformatics data from databases (like TBCyc, SRI's BioCyc collection [40,41], and Pathway Logic models [42][43][44][45]) with cheminformatics databases (like CDD) as well as computational modeling approaches is rarely attempted. When it is, synergies arise which can accelerate the process of drug discovery. For example, our essential metabolite approach using 3D pharmacophore model scoring of commercial chemical space, alone [6] or in combination with our Bayesian models for antitubercular whole-cell efficacy [7], may be viewed as intermediate between high-throughput screening and rational structure-based drug design. Our previous experiments with a multi-tiered, integrative informatics workflow (using pharmacophores, Bayesian model for whole cell activity and other filters for molecular properties) identified two acylthioureas suggested as mimics of D-fructose 1,6-bisphosphate which modestly inhibited the growth of Mtb, and have served as a starting point for further optimization [7]. Our approach in this study differed in that we used 3D pharmacophore models and consensus amongst three dual event Bayesian models [4,26,27] to select compounds for testing. It produced three hits with MIC less than or equal to 40 μg/mL. If we tighten the threshold of a hit and lower it to anything less than 10 μg/mL, the two most active retrieved out of 110 represents a hit rate much lower than those we have previously described using Bayesian methods alone [46,47]. This may be due to the far more stringent approach we have taken using multiple Bayesian models and pharmacophore models as well as other calculated properties.
Additional recent computational approaches have been proposed to select antitubercular compounds such as the druggome approach, which uses structural information on Mtb targets [48], although others have suggested this still requires some refinement [49]. Pharmacophores for specific Mtb targets have been used recently for virtual screening for acetohydroxyacid synthase inhibitors as a prefilter to docking [50]. A second study developed a pharmacophore from crystal structures for InhA and used this alongside docking to identify inhibitors [51]. Each of these cases represents the standard approach of focusing on a single target and a single pharmacophore, while in the current study we have used 66 pharmacophores representing many targets in Mtb to potentially identify compounds of interest from a vendor library.
With our hybrid pharmacophore-Bayesian approach, the two most active hits were retrieved by the pharmacophore based on menadione. The pharmacophore consists of two hydrogen bond acceptors and a hydrophobic feature. The enzyme NuoD (Rv3148) NADH dehydrogenase I chain D, is a subunit of NADH dehydrogenase I. The reaction catalyzed by this enzyme uses the substrate menadione and involves a complex of 13 other subunits. After testing BAS04912643 we identified that this compound-a quinoxaline 1,4-di-N-oxide-had been previously identified with an MIC of 3.13 μg/mL, possessed similar activity against resistant strains of Mtb, and had no appreciable mammalian cell cytotoxicity [52]. This earlier study had also shown that an analog of BAS04912643 was active in vivo. Further work by others has shown that quinoxaline-2-carboxylate 1,4-di-N-oxides have in vitro antitubercular activity and at least one compound was found to be active in vivo [53]. Similarly the closely related benzotriazine di-N-oxides have also been shown to have activity in vitro [54].
These compounds are likely bioreductively activated and we were keen to further explore the quinoxaline 1,4-di-N-oxide core (Table 1). SRI50 was most potent (MIC = 0.32 μg/mL) and had a greater than 10 fold higher cytotoxicity versus Vero cells. This and other analogs (SRI54 and SRI58) featuring promising in vitro activity and relative lack of Vero cell cytotoxicity or interesting substitutions, were then profiled for in vitro ADME properties. Attempts to move SRI58 into in vivo infection studies in mice were halted with the failure to observe quantifiable levels of the compound in mice upon dosing iv or po. The iv result, in particular, suggests that rapid metabolism of SRI58 may be occurring that was not observed in the in vitro mouse liver microsomal stability studies.
Given the lack of mechanistic information as to the Mtb target/s of the quinoxaline di-Noxides other than our demonstration of a lack of cross-resistance of SRI50 with front-line (isoniazid, rifampicin, ethambutol) and second-line (p-aminosalicyclic acid, capreomycin, streptomycin, kanamycin) drugs, we have begun to probe their mechanism of action via transcriptional profiling. Interestingly, Mtb treated with an early compound of interest, SRI54 afforded a transcriptional response distinct from menadione, the Mtb metabolite our calculations suggested it may mimic in terms of 3D pharmacophore. The transcriptional data do, however, suggest a stress response of Mtb to SRI54 exposure. SRI54 treatment resulted in upregulation of genes (prpC, prpD, and icl1) within the methylcitrate cycle, reminiscent of the response of Mtb to isoniazid, rifampicin, and streptomycin exposure reported by the Rhee laboratory [55]. Also, potentially indicative of the Mtb response to SRI54 is the up-regulation of genes involved in DNA repair. While quinaxoline di-N-oxides have been reported to cleave DNA through their enzymatic reduction [56], it remains to be demonstrated whether this is the result of specific damage to Mtb DNA by SRI54 or a downstream consequence of the engagement of other target/s. Finally, hierarchical clustering of the transcriptional responses of Mtb to known antitubercular agents demonstrated a similarity of response to SRI54, CD117, 2,4-dinitrophenol, and carbonyl cyanide 3-chlorophenylhydrazone. It remains to be seen whether SRI54 and other quinoxaline di-N-oxides inhibit Mtb fatty acid biosynthesis as does CD117 [37] or disrupt the proton gradient of the transmembrane electrochemical potential like 2,4-dinitrophenol and carbonyl cyanide 3-chlorophenylhydrazone [30].
We have also applied our computational approach of using Bayesian models for targets to predict the possible targets for the hits retrieved using TB Mobile [11,16,17]. Since the TB Mobile database does not currently include NuoD, it may be unable to predict the assumed targets of the two most active hits correctly. It does, however, suggest additional potential targets. BAS 04912643 was predicted by the Bayesian models to inhibit FtsZ, CysH, DprE1 and Rv1885c. BAS00623753 was predicted to modulate DprE1, Rv1885, DprE2, CysH and Alr. When looking at each hit and its target according to the app, the suggested target for BAS04912643 was CysS, while that from BAS00623753 was Glf. This computational approach may help prioritize targets for further testing in future. Others have recently shown how multiple computational approaches can be successfully used to predict targets that were ultimately experimentally validated [57].
In summary, we have presented the utilization of a combined bioinformatics/cheminformatics platform to arrive at candidate inhibitors of essential Mtb enzymes, through mimicry of the substrate/s or product/s as judged by 3D pharmacophore fit, that are predicted by a consensus amongst Bayesian models to have whole-cell efficacy. Compounds that passed the selection criteria were then tested in vitro versus Mtb and hits were validated and then optimized. We have shown clearly this strategy can lead to in vitro active hits that are readily synthesized (quinoxaline di-N-oxides), one of which had been previously shown to be an analog of a compound with both in vitro and in vivo activity [52] and potentially worthy of further evaluation because of its cost of goods. This same approach could be applied to other neglected diseases such as malaria to identify compounds with activity and the potential target/s involved. In the process of this work we have developed a computational workflow that was initially dependent on manual operation. Using the API for CDD Vault we can now also enable the automation of the computational process such that the user can go from computational target selection for pharmacophore generation to identification of molecules from vendor libraries with pharmacophores.

Ethics Statement
Rutgers animal care and use committee approved this work, IACUC #12106A9.

Reagents and molecules
All experimental compounds for initial screening were purchased from Asinex (Winston-Salem, NC, USA) or synthesized in house. Purities were required to be greater than 90% with a majority of commercial compounds having a purity of greater than 95%. Compounds were all dissolved in dimethyl sulfoxide (Sigma Aldrich) at a stock concentration of 8.0 mg/mL immediately and then diluted for biological testing.
All reagents for chemical synthesis were purchased from commercial suppliers and used without further purification unless noted otherwise. All chemical reactions occurring solely in an organic solvent were carried out under an inert atmosphere of argon or nitrogen. Analytical TLC was performed with Merck silica gel 60 F 254 plates. Silica gel column chromatography was conducted with Teledyne Isco CombiFlash Companion or Rf+ systems. 1 H NMR spectra were acquired on Varian Inova 400, 500 and 600 MHz instruments and are listed in parts per million downfield from TMS. LC-MS was performed on an Agilent 1260 HPLC coupled to an Agilent 6120 MS. All synthesized compounds were at least 95% pure as judged by their HPLC trace at 250 nm and were characterized by the expected parent ion(s) in the MS trace. The Supplementary Materials include synthetic details pertinent to the arylamide and quinoxaline di-N-oxide series.

Identification, annotation and publication of new potential Mtb enzyme targets
We previously described [7] in detail 1) the identification of essential in vivo enzymes of Mtb, 2) the collection of metabolic pathway and reaction information for the essential enzymes, 3) the comparison of non-human-homologous enzymes with Mtb in vivo essential gene set, and 4) the selection of Mtb targets that are essential in vivo but not homologous to human proteins and not known as TB drug targets. 25 in vivo essential enzymes (step 1) were noted in recent reports from the literature [58][59][60][61] and these include FbpC, DacB1, Cyp125, BioA, ArgJ, Nrp, SseA, End, BioF1, CobL, GcvT, AceE, HemN, AccD1, SerB2, AmiD, HsaF, Tal, FabG, NuoD, ProA, MalQ and ArcA. Among them, 2 enzymes have no human homologs (FbpC and DacB1). From a recent publication [62], we noted 32 Mtb enzymes with no human homologs and these are different from 66 nonhuman homologs found previously [7]. Except FbpC, 31 of these enzymes are not in vivo essential. Among these 31 non-homologous proteins, 17 are metabolic choke points [63] and 18 have the highest number of interactions with pathogenesis causing proteins. These in total give us 46 candidate essential enzyme targets among which 16 have PDB structures with a ligand bound. We have listed all these targets and annotated them (S1 Table) with respect to gene details, pathway information, structural evidence, predicted essentiality, orthologs and inhibitors information. These are supported with links to relevant databases and PubMed references. This list is published within the CDD public database (https://app.collaborativedrug. com/register) to be explored by the scientific community.

In silico approaches for selecting molecules
For each molecule a 3D pharmacophore was developed using Discovery Studio 3.5 (Biovia, San Diego, CA) from 3D conformations of the substrate or metabolite. This identified key features, onto which was mapped a van der Waals surface for the molecule [3,6,64]. The pharmacophore plus shape was then used to search the Asinex Gold compound database (N = 205,997, for which up to 100 molecule conformations with the FAST conformer generation method with the maximum energy threshold of 20 kcal/mol, were created). The in silico hits were collated and uploaded in CDD, and three previously described and validated dual event Bayesian models (MLSMR, CB2 and Kinase) [8,10,[22][23][24][25] for Mtb whole-cell activity were used to score the compounds and the data re-imported in CDD. All of the molecules used to build the Bayesian models are available as freely accessible datasets at www.collaborativedrug.com and Figshare [9,23,24]. Finally the compounds were filtered based on pharmacophore fit values > 2. 5 and Bayesian scores that predicted whole-cell activity as "true". Therefore, a compound has to comply with these criteria to be selected. Through this process 141 molecules were retrieved which was further narrowed to 110 molecules based on the opinion of an experienced medicinal chemist, removing compounds with reactive/unstable chemical functionality [28]. These compounds were then purchased for testing.

CDD TB DB
The development of the CDD database has been described previously with applications for collaborative malaria [65] and TB research [3,4]. The literature data on Mtb drug discovery has been curated and over~20 Mtb specific datasets are hosted, representing well over 300,000 compounds derived from patents, literature and high throughput screening (HTS) data, and we have termed this the CDD TB DB. Some of these datasets were used to develop the Bayesian models used in this study [8]. The data generated in this study was saved in a secure CDD Vault for collaborators to share.

CDD application programming interface development
We have described a complex workflow between different computational databases such as TBCyc and CDD Vault, as well as computational model development with Discovery Studio. To facilitate connectivity between these software packages, an application programming interface (API) was developed which allows this connectivity between software tools. The goal of this was to provide a user interface for curating TB drug targets and molecules to fully exploit published literature and data created in this project. This also integrates database searching with computational modeling tools by defining data exchange formats that enable both interactive and fully automated modeling, database searching, hit scoring and compound selection for purchasing. Ultimately this extends data types and computational modeling software capabilities upstream to target identification and validation capabilities.
The current version of TBCyc (MTH37RVV) was extended by adding drug candidates (in mol2 format) to the database via a small script of custom LISP using the Pathway Tools' API. Associated gene/proteins were added as regulated entities of the added compounds and linked externally to a CDD Vault via the 'dbdef' and 'linkdef' command line options for linking PGDB entries with external URLs.

Measurement of Antibacterial Activity Against Mtb
We used the resazurin (Alamar Blue) assay as the primary screen for activity against replicating Mtb [66]. Each compound was tested over a range of concentrations to determine the MIC. The antimicrobial susceptibility test was performed in a clear-bottomed, round well, 96-well microplate. Initial compounds were tested at 8 concentrations ranging between 40 and 0.31 μg/ mL with a final DMSO concentration of 1.25% in each well. After a growth medium containing 10 4 bacteria was added to each well, the different dilutions of compounds were added. Controls included wells containing (1) concentration of rifampin and isoniazid ranging from 0.00039 to 8.0 μg/mL to control for assay performance, (2) wells with bacteria, growth medium, and vehicle (1.25% DMSO), and (3) sterility control wells with medium. Plates were incubated at 37°C for 6 d in an ambient incubator at which time 5 μL of 1% resazurin dye was added to each well. After 2 d of incubation, visual inspection of color (pink, periwinkle or blue) was recorded for each well along with measurements of fluorescence in a microplate fluorimeter with excitation at 530 nm and emission at 590 nm. The lowest drug concentration that inhibited growth of 90% of Mtb bacilli in the broth was considered the MIC value [67]. Rifampicin (MIC range 0.0031-0.012 μg/mL) and isoniazid (MIC range 0.0031-0.012 μg/mL) were used as positive controls and were consistently in the acceptable range. The MIC against MDR strains was also tested using the AlamarBlue1 Cell Viability Reagent (DAL1100, ThermoFisher Scientific) as described above, except the MICs were read using absorbance as per manufacturer's recommendation.

Cytotoxicity determination
Vero cells (CCL-81, ATCC) were plated in 96-well plates (~5x10 4 cells/well) and incubated overnight in cell culture media (MEM + 5% FBS + 1% Pen/strep + 1% L-Glutamine). Stock solutions of test compounds were added to cells at concentrations from 0.5-50 μM concentrations with a final DMSO concentration of 0.645% for 72 h at 37°C with 5% CO 2 . At the end of this incubation period, cell viability was measured using a Cell Titer-Glo Luminescent cell viability assay (Promega) according to manufacturer instructions. Treatment with 5% DMSO was used as a control for maximal cytotoxicity and 0.645% DMSO as a negative control. CC 50 values were derived from plotting the calculated percent viability as a function of compound concentration and fitting the results to a four-parameter logistical function in GraphPad Prism.

ADME/Tox Screening
With a considerable percentage of drug failures attributed to ADME/Tox (Absorption, Distribution, Metabolism, Excretion and Toxicity) issues [68,69], it is important to assess these qualities early in the drug development process. Kinetic solubility, metabolic stability, and Caco-2 permeability were evaluated by Cyprotex (Watertown, MA).
Kinetic solubility. Serial dilutions of the test agent were prepared in DMSO at 100x the final concentration. Test article solutions were diluted 100-fold into pH 7.4 phosphate-buffer saline (PBS) in a 96-well plate and mixed. After 2 h at 37°C, the presence of precipitate was detected by turbidity (absorbance at 540 nm). An absorbance value of greater than 'mean + 3x standard deviation of the blank' (after subtracting the background) was indicative of turbidity. For brightly colored compounds, a visual inspection of the plate was performed to verify the solubility limit determined by UV absorbance. The solubility limit was reported as the highest experimental concentration with no evidence of turbidity.
Metabolic stability assays. The test agent was incubated in duplicate with mouse liver microsomes at 37°C. The reaction contained microsomal protein in 100 mM K 3 PO 4 , 2 mM NADPH, and 3 mM MgCl 2 at pH 7.4. A control was run for each test agent omitting NADPH to detect NADPH-free degradation. At t = 0 and 60 min, an aliquot was removed from each experimental and control reaction and mixed with an equal volume of ice-cold Stop Solution (methanol containing propranolol as an internal standard). Stopped reactions were incubated at least ten min at -20°C, and an additional volume of water was added. The samples were centrifuged to remove precipitated protein, and the supernatants were analyzed by LC/MS/MS to quantitate the remaining parent. Data were reported as % remaining by dividing by the time zero concentration value.
Intestinal permeability assays. Caco-2 cells grown in tissue culture flasks were trypsinized, suspended in medium, and the suspensions were applied to wells of a Millipore 96 well Caco-2 plate. The cells were allowed to grow and differentiate for three weeks, feeding at 2 d intervals. For Apical to Basolateral (A->B) permeability, the test agent was added to the apical (A) side and amount of permeation was determined on the basolateral (B) side; for Basolateral to Apical (B->A) permeability, the test agent was added to the B side and the amount of permeation was quantified on the A side. The A-side buffer contained 100 μM Lucifer yellow dye in Transport Buffer (1.98 g/L glucose in 10 mM HEPES, 1x Hank's Balanced Salt Solution) pH 6.5, and the B-side buffer was Transport Buffer, pH 7.4. Caco-2 cells were incubated with these buffers for 2 h, and the receiver side buffer was removed for analysis by LC/MS/MS (with propranolol used as an internal standard). To verify the Caco-2 cell monolayers were properly formed, aliquots of the cell buffers were analyzed by fluorescence to determine the transport of the impermeable dye Lucifer Yellow. Any deviations from control values were reported.
Data were expressed as permeability: (P app ) dQ/dt was the rate of permeation, C0 was initial concentration of test agent, and A was the area of monolayer.
In bidirectional permeability studies, the Efflux Ratio (R e ) is also calculated: R e ¼ P app ðB!AÞ P app ðA!BÞ An R e > 3 indicates a potential substrate for P-glycoprotein or other active transporters.

Target predictions
Over 700 compounds with known Mtb targets were initially collated from the literature [7] and made available in the mobile application TB Mobile (Collaborative Drug Discovery Inc. Burlingame, CA) which is freely available for iOS and Android platforms [17,72]. This dataset was recently updated in TB Mobile 2.0 to 805 compounds and covers 96 targets [16]. Molecules representing hits from this study were input as queries in TB Mobile versions 1 and 2.0 and the similarity of all molecules calculated in the application. A Principal Component Analysis (Discovery Studio) was also performed with all the molecules in version 1. In both versions 1 and 2 of the app, the top most structurally similar compounds (Compounds are ranked by most similar first as Tanimoto similarity is not specified in the app) were used to infer Mtb targets. Bayesian models integrated in the version 2.0 app were also used to predict targets. Clustering molecules with TB Mobile compounds was also undertaken in Discovery Studio (S4 Table).
Supporting Information S1 Data. Compounds synthesized and described in Tables 1 and 2