Skip to main content
  • Loading metrics

Using Multiple Microenvironments to Find Similar Ligand-Binding Sites: Application to Kinase Inhibitor Binding

  • Tianyun Liu,

    Affiliation Department of Genetics, Stanford University, Stanford, California, United States of America

  • Russ B. Altman

    Affiliations Department of Genetics, Stanford University, Stanford, California, United States of America, Department of Bioengineering, Stanford University, Stanford, California, United States of America


The recognition of cryptic small-molecular binding sites in protein structures is important for understanding off-target side effects and for recognizing potential new indications for existing drugs. Current methods focus on the geometry and detailed chemical interactions within putative binding pockets, but may not recognize distant similarities where dynamics or modified interactions allow one ligand to bind apparently divergent binding pockets. In this paper, we introduce an algorithm that seeks similar microenvironments within two binding sites, and assesses overall binding site similarity by the presence of multiple shared microenvironments. The method has relatively weak geometric requirements (to allow for conformational change or dynamics in both the ligand and the pocket) and uses multiple biophysical and biochemical measures to characterize the microenvironments (to allow for diverse modes of ligand binding). We term the algorithm PocketFEATURE, since it focuses on pockets using the FEATURE system for characterizing microenvironments. We validate PocketFEATURE first by showing that it can better discriminate sites that bind similar ligands from those that do not, and by showing that we can recognize FAD-binding sites on a proteome scale with Area Under the Curve (AUC) of 92%. We then apply PocketFEATURE to evolutionarily distant kinases, for which the method recognizes several proven distant relationships, and predicts unexpected shared ligand binding. Using experimental data from ChEMBL and Ambit, we show that at high significance level, 40 kinase pairs are predicted to share ligands. Some of these pairs offer new opportunities for inhibiting two proteins in a single pathway.

Author Summary

Small molecule drugs may interact with many proteins. Some of these interactions may cause unexpected effects, including side effects or potentially useful therapeutic effects. One way to predict these effects is to analyze the three-dimensional structure of target proteins, and identify new binding sites for small molecule drugs. Several methods have been proposed for predicting new binding sites, relying on geometric and functional complementarity of the sites and the small molecules. In this paper, we report on a new method for identifying novel protein-drug interactions by analyzing the similarity between binding sites in proteins. The method has relatively weak geometric requirements and allows for conformational change or dynamics in both the ligand and protein. Our results show that geometric flexibility is useful for effectively comparing sites. We have applied the method to evolutionarily distant kinases, and find unexpected shared inhibitor binding. Our results may be valuable for drug repurposing in order to find novel uses for existing kinase inhibitors.


Structural biology studies have provided large numbers of high-resolution proteins, often bound to small molecule ligands. The ability to predict additional ligands that will bind these proteins is an exciting opportunity for understanding drug action and repurposing. In some cases, the binding of a small molecule to a protein may explain otherwise unexpected effects of the small molecule, such as side effects of drugs. In other cases, the binding of a small molecule may suggest new uses of existing drugs, based on unexpected affinity to new targets.

Previous methods for predicting the potential binding of small molecules to protein pockets have used evolutionary, structural, biochemical and geometric properties in order to assess pocket similarity, or ligand-pocket complementarity [1], [2], [3], [4], [5]. For example, the method of sequence order-independent profile-profile alignment (SOIPPA) [6] can recognize binding site similarity between the cholesteryl ester transfer protein (CETP) and off-targets, including retinoid X receptor and peroxisome proliferator-activated receptors (PPARs). These new targets may explain adverse drug effects of CETP inhibitors [7]. SOIPPA represents binding sites with a tessellation of C-alpha atoms and characterizes binding sites using geometric similarity potentials. SOIPPA evaluates 3D alignments between binding sites that are enriched for similar angles and distances between residues. It then gauges overall similarity based on geometric criteria, evolutionary and biochemical properties.

Like SOIPPA, other methods for locally comparing binding sites typically have three steps [5]: (1) representation of binding sites, (2) 3D alignment between two sites and (3) evaluation of a similarity metric to the two sites. Searching for the best 3D alignment is the essential step. There are geometric hashing methods (SiteEngine [8] and SiteBase [1]) and methods based on clique detection (SOIPPA [6], CavBase [9] and eF-site [10]). These methods use thresholds to control the similarity of local geometries in both types of methods, but these can be difficult to set. In particular, flexible matching can be critical in achieving high performance [11]. Thornton et al showed that binding sites with similar ligands display greater conformational variability than the corresponding ligand molecules [12]. Thus, using predefined geometric models and thresholds is not optimal. Excessive reliance on crystallographic poses for both the protein and the ligand can miss potential similarities.

We have previously described the FEATURE methods for describing active sites [13]. In the FEATURE representation, a protein site is represented by one or more microenvironments—statistical descriptions of the occurrence of atoms, residues, secondary structures as well as biochemical and biophysical properties in radial shells around a central point [14]. We have shown that the FEATURE representation is useful for describing sites for Ca++ binding [14], Mg++ binding [15], serine protease active sites [13], thioredoxin active sites [16] and others [17]. We have also used it to evaluate the ability of engineered loops to bind ligands [18]. In this work, we reasoned that the interactions between microenvironments in the target protein and chemical fragments in the ligand may drive molecular recognition. Because the conformational arrangement of fragments within flexible ligand molecules can be very different, we allow the corresponding microenvironments to adopt different relative geometries. We develop an algorithm, PocketFEATURE, to match microenvironments within pockets in order to find pockets with potentially similar binding capabilities.

We validate PocketFEATURE by testing its performance on two tasks. First, we test its ability to detect the similarity of binding sites that are known to bind the same (containing adenine-ribose) ligands–a test of sensitivity. Second, we test its ability to detect FAD binding across a large proteome-scale set of proteins–a test of specificity. In both tests, the method shows very strong performance, including an ability to detect similarities missed by other methods.

Kinases play an important role in cell signaling, and can be dysregulated in cancer. Several recently introduced cancer drugs act as ATP analogues and inhibit kinase action [19],[20]. Therefore, the binding capabilities (or binding profiles) of kinases may be useful for discovering novel kinase inhibitors. In particular, kinase inhibitors that bind two or more kinases in the same pathway are attractive because they can more effectively interfere with the pathway, without the need for very high doses or multiple drugs [21], [22]. Kinase binding profiles may also be useful for understanding side effects of drugs that bind multiple proteins [22]. In principle, the ability of an inhibitor to bind multiple kinases may occur despite very low sequence and structure homology. For these reasons, the ability to detect similar binding sites in divergent kinases is potentially valuable.

There are sixteen cancer drugs approved or in advanced development that are known to have multiple targets [22]. All of these drugs target members of the same protein family that regulate the same signaling process. Accordingly, most studies have focused on dissecting the detailed binding preferences of drugs on a relatively small set of kinases that are known to be important. In fact, multi-target drugs that work on proteins within distant families are only rarely reported [23]. Therefore, one of the goals of this work is to discover unrecognized binding similarities between remotely related proteins to increase the repertoire of kinase inhibitor action and utility. Accordingly, we apply PocketFEATURE to predict the similarity in inhibitor-binding profiles between kinases. In particular, we seek similar inhibitor-binding profiles for kinases that are evolutionarily distant.


Detecting binding site similarity

We first validated PocketFEATURE's ability to detect binding site similarity. In particular, we tested its ability to recognize pockets that bind similar ligands, and distinguish them from pockets that do not bind these ligands. The benchmark dataset is provided by Bourne group from USCD [6]. The ligands sharing similarity are: ATP, ADP, NAD, FAD, SAH and SAM, all of which contain an adenine-ribose moiety. We compared the performance of PocketFEATURE to SOIPPA, which outperforms other ligand-binding site comparison algorithms in this task [6]. Figure 1 shows PocketFEATURE's ability to recognize 30381 pairs of sites that both bind adenine-ribose moieties, and to recognize the lack of similarity of 24947 control pairs of sites in which one site binds adenine-ribose and the other does not. The AUC for the entire benchmark is 0.85. At specificities of 95% to 99.5%, PocketFEATURE outperforms SOIPPA. At 95% specificity, PocketFEATURE identifies about 40% similar pairs, while SOIPPA identifies less than 30%. Our results demonstrate that PocketFEATURE can identify binding sites with overlapping chemical specificity. Proteins in this test are evolutionarily divergent (<5% of them are from the same SCOP superfamily). Thus, PocketFEATURE can detect site similarity across remote evolutionary relationships.

Figure 1. Comparison of the performance of PocketFEATURE to SOIPPA.

The benchmark measures the ability to discriminate 30381 pairs of sites that both bind adenine-ribose moieties from 24947 control pairs of sites where one site binds adenine-ribose and the other does not. The AUC for the entire benchmark is 0.85. At the specificity range of 95% to 99.5%, PocketFEATURE outperforms SOIPPA. At 95% specificity, PocketFEATURE identifies about 40% positive pairs, while SOIPPA identifies less than 30%.

Figure 2 shows an ATP-binding site (1kvk), an NAD-binding site (1a5z) and an FAD-binding site (2b9w). There are five corresponding microenvironments in each protein (spheres are shown in Figure 2A and the types of microenvironments are listed in Figure 2B). These five microenviroments are in proximity to the adenine moiety from ATP, NAD and FAD molecules. It is important to note that the five microenvironments adopt different relative geometries in these three sample sites while coordinating the ligands. We consider the five microenvironments to constitute modules capable of recognizing adenine-ribose moieties contained within diverse ligand molecules.

Figure 2. An example illustrating how PocketFEATURE identifies similar sites that bind to ligands with overlapping chemical specificity.

We compare an ATP-binding site (1kvk/ATP), an NAD-binding site (1a5z/NAD) and an FAD-binding site (2b9w/FAD). There are five sets of mutual aligned microenvironments between the three sites. (A) 3D structures of the binding site. The five sets of mutual aligned microenvironments are represented as colored spheres: red for microenvironment centered at residue type E(mE) or D (mD), blue for mT, green for mY, orange for mV or mI and purple for mP. The five microenviroments are close to the adenine moiety from ATP, FAD and NAD molecules. The aligned microenvironments display different relative geometries in the three sites. 3D illustrations were generated using PyMOL [37]. (B) Index of center residues for the mutual aligned microenvironments. (C) Significance of alignments. Tc scores between the five sets of mutual aligned microenvironments are mapped to the pre-calculated PDF for a given type of microenvironment pair. The scores between ATP-binding and NAD-binding sites are marked using diamonds, those between ATP-binding and FAD-binding using circles, and those between NAD-binding and FAD-binding using squares. For each of the five sets, the S(Tc) scores fall within the p-value cutoff of 0.05.

The S(Tc) score captures the similarity of a pair of FEATURE microenvironments. To evaluate the significance of alignments between microenvironments, we calculated the probability distribution function (PDF) of microenvironment similarity score S(Tc) using a non-redundant dataset of 3D structures in PDB. Given each set of the mutual alignment between ATP, NAD and FAD binding sites, Figure 2C maps the three S(Tc) scores to the corresponding microenvironment types. The alignments of high significance form the basis for recognizing similarity between binding sites detected by PocketFEATURE.

Detecting FAD-binding sites in a large druggable database

The previous test demonstrated that PocketFEATURE can sensitively detect similarities between adenine-ribose containing pockets. We tested PocketFEATURE's ability to specifically recognize FAD-binding sites on a structural proteome scale. The Dataset-6958 is derived from an annotated database of “druggable binding sites” from PDB called scPDB (see Method section). It includes a total of 6709 non-FAD-binding proteins and 249 FAD-binding proteins, from 43 EC-families. Using a single arbitrarily selected FAD-binding site (1nhp/FAD) as a query structure, we searched the database for all other FAD-binding sites from Dataset-6958 by pairwise comparison. Figure 3A shows the overall performance with an AUC value of 0.92. At 95% specificity, PocketFEATURE identifies about 65% sites that are known to bind FAD. At 80% specificity, PocketFEATURE identifies nearly 90% sites correctly.

Figure 3. Identification of FAD-binding sites in a large druggable database.

(A) The overall performance with an AUC value of 0.92 At 95% specificity, PocketFEATURE identified about 65% sites that are known to bind FAD. At 80% specificity, PocketFEATURE identified nearly 90% sites correctly. (B) Microenvironment alignments between two FAD-binding sites. Two sites adapt to two different ligand conformations, an elongated (1nhp) and a bent butterfly conformation (1fdr). The relative geometries arrangements of aligned microenvironments are different. (C) An illustration of the corresponding positioning of aligned microenvironments and FAD molecules. The aligned microenvironments m6L/1nhp-m67L/1fdr (red), m250W/1nhp-m248W/1fdr (orange) and m79I/1nhp-m68V/1fdr (yellow) are near the adenine moiety of FAD molecules in 1nhp and 1fdr, respectively. Near phosphate chemical groups, aligned m132R/1nhp-m77P/1fdr (pink) and m12G/1nhp-m77P/1fdr (green) are observed. Another two sets of microenvironments m159Y/1nhp-m52Y/1fdr (cyan) and m300T/1nhp-m53S/1fdr (wheat) are found close to flavin groups.

Of note, FAD binds proteins in two general conformations [24]: the (1) elongated and the (2) bent butterfly conformation (Figure 3B). In the bent conformation (1nhp in Figure 3B), the AMP portion is folded back, placing the adenine and isoalloxazine rings in close proximity, whereas in the elongated conformation (1fdr in Figure 3B) the adenine ring is distant from the isoalloxazine ring. Remarkably, PocketFEATURE can detect bent butterfly FAD sites based on an elongated FAD query structure. Conversely, using a binding site with a bent butterfly FAD as a query structure, PocketFEATURE can detect elongated FAD binding sties (Supplementary Material Figure S1).

Figure 3B and 3C show the microenvironment alignments detected by PocketFEATURE. (See Method section, A microenvironment is named using the following convention: “m” followed by “residue index” and “residue type”, upon which the microenvironment is centered. PDB identifier or gene name is tagged when necessary.) The aligned microenvironments m6L1nhp-m67L/1fdr (red), m250W/1nhp-m248W/1fdr (orange) and m79I/1nhp-m68V/1fdr (yellow) are near the adenine moiety of FAD in 1nhp and 1fdr, respectively. Aligned microenvironments m132R/1nhp-m77P/1fdr (pink) and m12G/1nhp-m77P/1fdr (green) are adjacent to phosphate chemical groups. Another two sets of microenvironments m159Y/1nhp-m52Y/1fdr (cyan) and m300T/1nhp-m53S/1fdr (wheat) are found near flavin groups. As might be expected, the different conformations of FAD molecules in the structures 1nhp and 1fdr lead to markedly different geometric arrangements of these aligned microenvironments. However, PocketFEATURE identifies these alignments with high significance, demonstrating the robustness of the algorithm.

Other related benchmarks

We have performed three independent experiments to test PocketFEATURE's ability to specifically recognize non-adenine ligand binding sites. First, using a typical steroid-binding site (1A28 bound with progesterone) to detect all other steroid-binding sites (total 83 sites) from Dataset-6985 (Supplementary Material Figure S2). The overall AUC is 0.826. Second, we compared PocketFEATURE to a 3D shape descriptor using real spherical harmonic expansion coefficients [25], [26]. Using their published datasets (10 sites for ATP, 10 for NAD, 10 for heme and 10 for steroids), PocketFEATURE successfully clusters the four types of sites and compares favorably with the real spherical harmonic shape descriptor (Supplementary Material Figure S3). Third, we applied PocketFEATURE to predicted off-targets for Torcetrapib from a non-redundant subset of PDB for 1200 human proteins. Supplementary material Table S1 lists top ranked off-targets predictions. Xie et al [7] published a panel of 20 off-targets for CETP inhibitors (specifically Torcetrapib) predicted by SOIPPA. These predictions have been refined and validated by docking methods and critical human curation. Comparing the 20 published off-targets by SOIPPA and the 36 predictions by PocketFEATURE, we find that seven are the same. These three tests show that PocketFEATURE effectively recognizes non-adenine ligand binding sites.

The Similarity Ensemble Approach (SEA) is a method for calculating chemoinformatics similarities between drug sets [27]. Given two binding sites, we incorporate the similarity scores between experimentally observed ligand molecules calculated by SEA into the PocketFEATURE similarity score between the binding sites. Preliminary results suggest that combining ligand chemoinformatics and target physiochemical properties leads to strong signals for similarity detection (Supplementary Material Figure S4).

Predicting inhibitor-binding for distantly related kinases

Having established reasonable sensitivity and specificity of PocketFEATURE, we studied ATP-binding sites in kinases, to find new targets for these inhibitors. We specifically targeted kinase proteins that are distantly related, with little detectable sequence or structural similarity, because most previous work has focused on the binding properties of closely related kinases.

We calculated the binding site similarity scores of the 6058 pairs sites in distantly related kinases using PocketFEATURE. Figure 4 plots the PDF of the scores for the 6058 pairs. A small p-value suggests an increased likelihood of high binding site similarity for the pair of proteins evaluated.

Figure 4. Predicting overlapping of inhibitor-binding profiles between kinases.

The binding site similarity scores of 6058 pairs of distant kinases were fitted into normal distribution. The more negative the score, the higher level of similarity between two sites is predicted. The blue dotted line is the p-value cutoff of 0.01 for highly ranked predictions (Table 1). Of the 6058 pairs, there are 40 pairs of which experimental data are available to both kinases. Out of the 40 pairs, 11 pairs have experimental results suggesting overlap in inhibitor binding – they are “positive-pairs”. The remaining 29 pairs are “negative-pairs” that do not share ligands. The significance of the separation of positive-pairs and negative-pairs in this ranking, as evaluated by a hyper-geometric distribution has a p-value of 4.5e-18.

CHEMBL is a manually curated chemical database of bioactive molecules with drug-like properties [28]. Our local database downloaded from CHEMBL includes binding assays between a total of 3199 protein domains and 541,137 compounds. Similarly, the Ambit panel provides high-throughput kinase selectivity profiling of 317 kinases against 37 known kinase inhibitors [29]. Of the 6058 pairs we evaluated, there are 40 pairs for which experimental data (from CHEMBL or Ambit) are available for both kinases. Of these pairs, 11 had experimental results suggesting high overlap in inhibitor binding—they are “positive-pairs”. When we rank order the 40 pairs based on their scores, nine of the 11 positive-pairs rank first through eighth, and tenth (Figure 4). The binding site similarity scores of negative-pairs range from −1.9 to −3.6—quite separate from the positives. The significance of the separation of positive-pairs and negative-pairs in this ranking, as evaluated by a hyper-geometric distribution has a p-value of 4.5e-18.

Using a p-value cutoff <0.01 from the PDF (binding site similarity score cutoff of −3.76 in Figure 4), 50 pairs of kinases are predicted to have inhibiting-profile overlapping. Nine pairs have experimental evidence suggesting overlap in inhibitor binding- these are true positives (Table 1). One pair is considered false positive because although the two kinases are tested in CHEMBL, they do not share ligands using our relatively stringent cutoff (Table 2). The other 40 pairs with high apparent similarity are novel predictions (Table 3). Experimental evidence for these kinases pairs is not available in CHEMBL or Ambit.


We have presented a new algorithm for detecting ligand-binding site similarity, tested it on (1) the recognition of adenine-ribose binding ligands and (2) the recognition of FAD binding sites. We then applied it to the problem of predicting cross binding of ATP analogues inhibitors of kinases.

Summary of PocketFEATURE method

Our method works for two reasons. First, we employ a novel microenviroment-based representation and scoring system for comparing pockets that captures the physical and chemical properties in the binding pocket, and second, we do not impose rigid geometric matching criteria on the microenvironments within the pocket. The resulting method accurately recognizes similar microenvironments, and identifies combinations of microenvironments that can interact with fragments within ligand molecules.

Representation and similarity measure between individual microenvironments.

Microenvironments are represented using the FEATURE representation that captures physiochemical properties of a local subsite. The raw Tc score measures similarity between two microenvironments. However, it is difficult to compare Tc scores across pairs of microenvironments because the background similarities between different pairs are not the same. Therefore, it is necessary to normalize the Tc scores. We normalize the scores by creating a background distribution for each pair type (See Method). The normalized score S(Tc) is negative, decreases monotonically with increasing Tc, and changing most rapidly for Tc>Tc0. Therefore, our method seeks microenvironment-pairs that have high similarity, given the expected background score distribution.

We aligned microenvironments from an ATP, FAD and an NAD binding site (Figure 2). Microenvironment similarity score S(Tc) of aligned ones are outliers within the PDF (Figure 2C). The aligned microenvironments constitute functional modules for ligand recognition (Figure 2A). That is, particular microenviroments are associated with the recognition of particular molecular fragments with ligands. A better (more negative) S(Tc) similarity captures this shared recognition role in different binding sites.

Geometric flexible matching between microenvironments.

The combined interactions between multiple microenvironments in a target protein and molecular fragments within its ligand molecule drive molecular recognition. Because fragments can adopt different poses within a ligand molecule, the FEATURE based microenvironments in the target protein can adopt different relative geometries, as shown in Figure 3C. Some methods use geometric constraints to control local arrangements of functional modules. PocketFEATURE has relatively weak geometric requirements (only that the matching microenvironments be present within the pocket of interest). As a result, the number of possible alignments between pockets is increased, and PocketFEATURE can recognize site that bind similar ligands in different poses.

Our results with FAD-binding sites illustrate the value of geometric flexibility; FAD contains highly flexible regions between flavin and adenine. The FAD conformation and orientation varies widely across different protein families. By using PocketFEATURE, microenvironments corresponding to the same fragments within FAD are recognized even though these microenvironments adopt different local geometries according to their ligand poses (Figure 3C). These results demonstrate the value of allowing microenvironments to adopt variable orientations within pockets.

We have also assessed whether the microenvironment alignments identified by PocketFEATURE correspond to specific recognition of ligand chemical substructures. The average number of microenvironments between two ATP sites is six; while that of ADP is six, that of NAD is ten and that of FAD is twelve. The common aligned microenvironments of high frequencies for these four types of binding sites are: mD, mE, mR, mK, mS and mT, all of which are in close proximity to the common sub-fragments contained within ADP, ATP, FAD and NAD: the adenine, ribose and phosphate. Thus, these microenvironments have specific roles in recognizing particular fragments within the overall molecule. Furthermore, the additional aligned microenvironments observed in FAD and NAD sites are mW and mP, which “recognize” the phosphate, flavin or nicotinamide moieties.

In summary, PocketFEATURE is more sensitive than other state-of-the-art methods. It is suitable for application on a genome scale.

PocketFEATURE defines sites based on position of known ligand binding. For apo structures and uncharacterized sites, we can also use published patch-searching algorithms, such as CONCAVITY [30] and PocketPicker [2] to define sites (on-going projects).

Application of PocketFEATURE to drug discovery

Current computational studies on comparing kinase inhibitor binding sites often focus on known drug targets Tyrosine kinase (TK family, EC 2.7.10) and Serine/Threonine kinases ISTE family (EC 2.7.11). Potential similarity between divergent kinases has not been systematically explored either computationally or experimentally. For example, of the 6058 pairs of divergent kinases we compared in this work, only 40 pairs have experimental data from CHEMBL and Ambit. The literature is biased towards a few well-validated kinases. Some inhibitors may appear to be more promiscuous simply because they have been profiled more systematically.

As shown in Figure 4, PocketFEATURE identifies similar ligand binding sites across distantly related kinases. The surprisingly accurate predictions (Table 1) for those that have been tested make the remaining untested predictions (Table 3) with high similarity scores particular interesting.

Some of our predictions rediscover combinations of divergent kinases for multi-targeted drug design. One highly ranked prediction is the pair SRC (a Tyrosine kinase) and PIK3CG (a lipid kinase in PI3K family), which are evolutionarily distant (PID<10%). Both kinases have been observed to bind to a similar series of inhibitors (PP121 and its derivatives) [23]. The PocketFEATURE binding site similarity score (−4.26) ranks high —20th out of 6058 pairs. PocketFEATURE identifies six key microenvironments (Figure 5). One alignment matches mT338/SRC to mY867/PIK3CG, both of which reside near the typical “gate” of an ATP-binding site. The gatekeeper residue enables interactions between the deep hydrophobic pocket and ligands (adenine in ATP or pyrazolopyrimidine in inhibitor ABJ). In SRC structures, residue T338 is designated as the gatekeeper and in PIK3CG structures, residue I879 [23]. As PocketFEATURE does not seek alignments between polar and hydrophobic residues, it does not find a match between mT338/SRC and mI879/PIK3CG. Instead, in the PIK3CG structure, the proximal mY867/PIK3CG is aligned to mT338/SRC. Thus, PocketFEATURE detects similar binding sites from these two distant kinases. SRC activates the lipid kinases of the PI3K family, a central regulator of cell growth. Molecules that target both SRC and PIK3CG may have potent antitumor activity.

Figure 5. An example of validated positive predictions.

SRC (a Tyrosine kinase) and PIK3CG (a lipid kinase) bind to a same series of inhibitors (PP121 and its derivatives). Four related structures are available in PDB. The first row is structures of SRC bound with ANP (PDB ID: 2src) and a drug-like inhibitor (PDB ID is 3en7 and ligand PDB code is ABJ). The second row is structures of PIK3CG bound with ATP (PDB ID is1e8x) and ABJ (PDB ID is 2v4l). Between binding sites of 2src (SRC/ANP) and 1e8x (PIK3CG/ATP), PocketFEATURE aligned six pairs of microenvironments. At the position near the typical “gate” of an ATP-binding site, PocketFEATURE aligned mT338/SRC to mY867/PIK3CG (light yellow sphere). In SRC structures, the gatekeeper residue T338 enables interactions between the deep hydrophobic pocket and ligands. The original experimental study suggests the residue analogous to the gatekeeper in PIK3CG is I879 (red circle), which is the nearest microenvironment to mY867/PIK3CG. (A). 3D illustration of binding sites in SRC and PIK3CG (B). Aligned microenvironments.

PIK3CG is also predicted to share ligands with other kinases, including CSNK21A, MAP2K1, GCN2 and ACK2 (Table 13). Of these, CSNK21A encodes a Casein kinase that is involved in Wnt signaling pathway. Casein kinase inhibitors are considered potent anti-cancer drug candidates [31]. It is possible that drugs inhibiting both CSNK21A and PIK3CG may have synergistic anti-tumor activity. ACK2 is a homolog of CSNK21A (PID>90%). The binding site similarity score between PIK3CG and ACK2 ranks highly—18th out of 6058 pairs.

In addition to its similarity to CSNK21A, the pocket in PIK3CG is very similar to that in MAP2K1. These proteins do not share known ligands using our stringent interpretation of CHEMBL results (see Method). However, this apparent false positive deserves further scrutiny. MAP2K1 encodes an essential kinase in mitogen-activated protein (MAP) kinase signal transduction pathway. Activation of MAP kinase pathway plays important roles in the metastasis of pancreatic cancer. Specific inhibitors have been developed to inhibit oncogenic pathways. However, activation of PI3K pathway in response to MAP2K1 inhibition through a negative feed back loop limits the efficacy [32]. The high similarity between PIK3CG and MAP2K1 by PocketFEATURE suggest the possibility of inhibitors that target both kinases.

Other highly ranked novel predictions have some implications for side effects and drug repurposing. In Table 3, there are three pairs of mammalian: {Mvk-MAP2K1}, {Mvk-EPHA3} and {1GF1R-OXSR1}. Of these kinases, MAP2K1, OXSR1 and EPHA3 are implicated in cancer. Further evaluation of these pairs may be warranted. Most of other pairs in Table 3 are combinations of one kinase from human (or other mammalian) and one from plant (or bacteria). It is intriguing to consider that inhibitors for plant or bacterial kinases may be useful inhibitors of mammalian proteins.


FEATURE microenvironments

Given a functional center of a residue, we use the term “microenvironment” to refer to the local, spherical region in the protein structure that may encompass residues discontinuous in sequence and structure. Specifically, we use the FEATURE system to calculate a set of 80 physicochemical properties (Table S2) [14] collected over six concentric spherical shells (total 480 properties = 80 properties×6 shells) centered on the predefined functional center (Table S3) [33]. The total radius of the microenvironment is 7.5 Angstroms. A microenvironment is named using the following convention: “m” followed by “residue index” and “residue type”, upon which the microenvironment is centered. PDB identifier or gene name is tagged when necessary. For example m6L/1nhp represents the microenvironment centered on the functional center of the sixth residue in 1nhp, which is leucine (L). A complete description of FEATURE can be found in the original publication.

Similarity measure between two FEATURE microenvironments

Given a pair of FEATURE microenvironments (A and B) derived from two different sites, we calculate an adjusted Tanimoto coefficient based on the presence/absence of similar properties. We compute a single standard deviation (STD) for each of the 480 properties across a random set of FEATURE microenvironments (see section “Background calculation). Two microenvironments have a “similar property” if they differ by less than one STD for the given property. Given A and B, c is the number of “similar properties”; a and b are the numbers of non-zero properties in A and B, respectively; the denominator is the total number of unique properties that are non-zero in A or B or both (a+b−c); then the Tanimoto similarity is as follows: .

Background calculation

We make observations of the background distributions of Tc scores between microenvironments from different sites. We compile a dataset of 1160 sites from a non-redundant set of 3D structures in PDB using these filters: (1) structures were solved using X-ray diffraction at resolutions higher than 2.0 Angstrom; (2) no two structures have greater than 40% sequence identity; (3) specifically bound small molecule ligands have more than the heavy atoms. The binding residues are defined as those having any atom within 6 Angstroms of the ligand molecules, resulting in a total of 22008 microenvironments. There are 242 possible types of pairs between 22 types of microenvironments centered on 20 residues types (two centers for residue W and Y), but not all of these are likely to be matched. For computational efficiency, we group residues by physical properties in order to avoid comparisons that are unlikely to yield high similarity scores. The comparisons are limited to pairs of microenvironments that fall within the same groups: positively charged (R H K), negatively charged (D E), polar (S T Q N W1 Y1), non-polar (A C G I L M P V) and aromatic (W2 Y2 F). This produces 72 microenvironment-pairs (out of the 200 possible) that we check for high similarity scores. Given each microenvironment-pair, we derive Tc scores from the above dataset and fit these score into normal distribution.

Comparing two binding sites

Given the FEATURE microenvironments from two binding sites, we exhaustively calculate the raw Tc scores of all permissible microenvironment-pairs. We then normalize the Tc scores using the background frequency [34]:

Tc0 is the value at which S(Tc) is zero.

In practice, Tc0 is the Tc score at the mode on the fitted cumulative distribution function (CDF) of a given type of microenvironment-pair (See Background calculation). The normalized value, S(Tc), measures the similarity between two microenvironments and is thus the microenvironment similarity score.

We search for the mutual best-scoring microenvironment-pairs between two binding sites and assign alignment to such pairs using an cutoff of S(Tc) less than −0.3. For example, between site A (microenvironments A1, A2, A3, A4 and A5) and site B (microenvironments B1, B2, B3, B4 and B5), we align A1 to B1 only when (1) S(Tc) between A1 and B1 is smaller than those between A1 and B1, B2, B4 B5, also smaller than those between B1 and A2, A3, A4, A5; (2) S(Tc) between A1 and B1 is smaller than −0.3. The sum of all aligned microenvironment-pairs is the overall similarity score between two binding sites, and is termed the binding site similarity score. We can vary the cutoff for S(Tc) to change the precision and resolution of the comparison.

Benchmark datasets

We perform two sets of benchmark. The first benchmark identifies pairs of proteins that bind adenine-containing ligands, using two datasets [6] provided by Bourne group from USCD. Dataset I consists of 247 sites from non-redundant protein structures known to bind an adenine-containing ligand (ATP, ADP, NAD, FAD, SAH and SAM); Dataset II consists of 101 cavities from non-redundant protein structures believed not to bind an adenine-containing moiety. From Dataset I, we have 30381 pairs of sites that both bind adenine-containing ligands. Between Dataset I and II, we have 24947 control pairs of sites in which one site binds adenine-containing ligands and the other does not.

The second benchmark recognizes FAD binding sites from a large-scale dataset. A dataset consisting of 6958 druggable binding sites (Dataset-6958) was derived from scPDB entries [35] by filtering out PDB entries where atom coordinates of binding residues were incomplete. The binding sites were defined by including all the protein residues with at least one atom within 6 Angstrom of any ligand atom. Dataset-6958 includes a total of 249 FAD-binding proteins.

Predicting inhibitor-binding profiles of kinases

We derive a subset of 984 binding sites from Dataset-6958, including proteins classified in eight different EC families: EC 2.7.1, EC 2.7.2, EC 2.7.3, EC 2.7.4, EC 2.7.10, EC 2.7.11, EC 2.7.12 and EC 2.7.13. A total of 203 sites bind to ATP, ANP or ADP. The other 781 structures bind to 633 other ligands.

Using the 203 sites involved in binding ATP, ANP or ADP, we generate pairs of binding sites from functionally and structurally distant kinases by applying two rules: (1) two kinases are from different EC sub-subgroups; (2) the identity of structural alignment by MAMMOTH [36] between the two kinases is not higher than 30%. This results in 6058 pairs of sites that bind to ATP, ANP or ADP.

We perform validation using experimental data from CHEMBL [28] and Ambit panel [29]. Ambit panel contains binding assays between 37 inhibitors (21 tyrosine kinase inhibitors, 15 serine-threonine kinase inhibitors, and 1 lipid kinase inhibitor), which are classified according to the targets for which they were originally developed, and 317 human kinases (287 different human protein kinases, three lipid kinases and 27 disease-relevant mutant variants). Staurosporine is a non-selective kinase inhibitor and is therefore removed from the validation dataset.

We use two standards to identify candidate inhibitors from Ambit panel: a stringent cutoff of Kd<1 uM and a ratio of off-target to primary target affinities (Kd off-target/Kd primary target) <100. Our local database downloaded from CHEMBL includes binding assays between 3199 protein domains and 541,137 compounds. It is a collection of assays from a variety of experimental studies and therefore the standards applied to Ambit Panel are not applicable to CHEMBL data. We first filter out data with confidence level less than seven. We then use either (1) a cutoff of IC50 less than 1 uM or (2) a cutoff of Kd less than 1 uM or (3) inhibition level higher than 90% at 1 uM as an indication of inhibition.

From the 6058 pairs of divergent kinases, we first search for pairs for which experimental data, from CHEMBL or Ambit, are available for both kinases. Given such a pair, if at least one compound/inhibitor satisfies the standards above for both kinases, the pair shares ligand binding and is a “positive-pair”; otherwise this pair is a “negative-pair”.

Supporting Information

Figure S1.

Performance of PocketFEATURE for predicting FAD-binding sites using different queries.


Figure S2.

Performance of PocketFEATURE for predicting steroids binding sites.


Figure S3.

Hierarchical clustering of 40 sites by binding site similarity scores calculated by PocketFEATURE.


Figure S4.

Incorporating ligand chemoinformatics similarities in PocketFEATURE leads to prediction of higher confidence.


Table S1.

Highly ranked off-target predictions.


Table S3.

Functional centers of 22 types microenvironments.



We thank the Bourne lab for kindly providing benchmark dataset. We thank Shoichet lab for providing data using SEA. We thank D. Lupyan and A.R. Ortiz from Mount Sinai School of Medicine for providing the MAMMOTH program. We also thank E. Capriotti for helpful discussion and N.P. Tatonetti for organizing a local CHEMBL database.

Author Contributions

Conceived and designed the experiments: TL RBA. Performed the experiments: TL. Analyzed the data: TL RBA. Contributed reagents/materials/analysis tools: TL. Wrote the paper: TL RBA.


  1. 1. Brakoulias A, Jackson RM (2004) Towards a structural classification of phosphate binding sites in protein-nucleotide complexes: an automated all-against-all structural comparison using geometric matching. Proteins 56: 250–260.
  2. 2. Weisel M, Proschak E, Schneider G (2007) PocketPicker: analysis of ligand binding-sites with shape descriptors. Chem Cent J 1: 7.
  3. 3. Weill N, Rognan D (2010) Alignment-free ultra-high-throughput comparison of druggable protein-ligand binding sites. J Chem Inf Model 50: 123–135.
  4. 4. Schalon C, Surgand JS, Kellenberger E, Rognan D (2008) A simple and fuzzy method to align and compare druggable ligand-binding sites. Proteins 71: 1755–1778.
  5. 5. Kellenberger E SC, Rognan D (2008) How to measure the similarity between protein-ligand binding sites. Curr Comput-Aided 4: 209.
  6. 6. Xie L, Bourne PE (2008) Detecting evolutionary relationships across existing fold space, using sequence order-independent profile-profile alignments. Proc Natl Acad Sci U S A 105: 5441–5446.
  7. 7. Xie L, Li J, Xie L, Bourne PE (2009) Drug discovery using chemical systems biology: identification of the protein-ligand binding network to explain the side effects of CETP inhibitors. PLoS Comput Biol 5: e1000387.
  8. 8. Shulman-Peleg A, Nussinov R, Wolfson HJ (2004) Recognition of functional sites in protein structures. J Mol Biol 339: 607–633.
  9. 9. Kuhn D, Weskamp N, Hullermeier E, Klebe G (2007) Functional classification of protein kinase binding sites using Cavbase. ChemMedChem 2: 1432–1447.
  10. 10. Kinoshita K, Furui J, Nakamura H (2002) Identification of protein functions from a molecular surface database, eF-site. J Struct Funct Genomics 2: 9–22.
  11. 11. Teague SJ (2003) Implications of protein flexibility for drug discovery. Nat Rev Drug Discov 2: 527–541.
  12. 12. Kahraman A, Morris RJ, Laskowski RA, Thornton JM (2007) Shape variation in protein binding pockets and their ligands. J Mol Biol 368: 283–301.
  13. 13. Bagley SC, Altman RB (1996) Conserved features in the active site of nonhomologous serine proteases. Fold Des 1: 371–379.
  14. 14. Wei L, Altman RB, Chang JT (1997) Using the radial distributions of physical features to compare amino acid environments and align amino acid sequences. Pac Symp Biocomput 465–476.
  15. 15. Banatao DR, Altman RB, Klein TE (2003) Microenvironment analysis and identification of magnesium binding sites in RNA. Nucleic Acids Res 31: 4450–4460.
  16. 16. Tang GW, Altman RB (2011) Remote thioredoxin recognition using evolutionary conservation and structural dynamics. Structure 19: 461–470.
  17. 17. Wu S, Liang MP, Altman RB (2008) The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation. Genome Biol 9: R8.
  18. 18. Liu T, Altman RB (2009) Prediction of calcium-binding sites by combining loop-modeling with machine learning. BMC Struct Biol 9: 72.
  19. 19. Benson JD, Chen YN, Cornell-Kennon SA, Dorsch M, Kim S, et al. (2006) Validating cancer drug targets. Nature 441: 451–456.
  20. 20. Sheridan RP, Nam K, Maiorov VN, McMasters DR, Cornell WD (2009) QSAR models for predicting the similarity in binding profiles for pairs of protein kinases and the variation of models between experimental data sets. J Chem Inf Model 49: 1974–1985.
  21. 21. Morphy R (2010) Selectively nonselective kinase inhibition: striking the right balance. J Med Chem 53: 1413–1437.
  22. 22. Ma XH, Shi Z, Tan C, Jiang Y, Go ML, et al. (2010) In-silico approaches to multi-target drug discovery: computer aided multi-target drug design, multi-target virtual screening. Pharm Res 27: 739–749.
  23. 23. Apsel B, Blair JA, Gonzalez B, Nazif TM, Feldman ME, et al. (2008) Targeted polypharmacology: discovery of dual inhibitors of tyrosine and phosphoinositide kinases. Nat Chem Biol 4: 691–699.
  24. 24. Dym O, Eisenberg D (2001) Sequence-structure analysis of FAD-containing proteins. Protein Sci 10: 1712–1728.
  25. 25. Morris RJ, Najmanovich RJ, Kahraman A, Thornton JM (2005) Real spherical harmonic expansion coefficients as 3D shape descriptors for protein binding pocket and ligand comparisons. Bioinformatics 21: 2347–2355.
  26. 26. Najmanovich R, Kurbatova N, Thornton J (2008) Detection of 3D atomic similarities and their use in the discrimination of small molecule protein-binding sites. Bioinformatics 24: i105–111.
  27. 27. Hert J, Keiser MJ, Irwin JJ, Oprea TI, Shoichet BK (2008) Quantifying the relationships among drug classes. J Chem Inf Model 48: 755–765.
  28. 28. Overington J (2009) ChEMBL. An interview with John Overington, team leader, chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBL-EBI). Interview by Wendy A. Warr. J Comput Aided Mol Des 23: 195–198.
  29. 29. Karaman MW, Herrgard S, Treiber DK, Gallant P, Atteridge CE, et al. (2008) A quantitative analysis of kinase inhibitor selectivity. Nat Biotechnol 26: 127–132.
  30. 30. Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA (2009) Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 5: e1000585.
  31. 31. Yang WS, Stockwell BR (2008) Inhibition of casein kinase 1-epsilon induces cancer-cell-selective, PERIOD2-dependent growth arrest. Genome Biol 9: R92.
  32. 32. Mirzoeva OK, Das D, Heiser LM, Bhattacharya S, Siwak D, et al. (2009) Basal subtype and MAPK/ERK kinase (MEK)-phosphoinositide 3-kinase feedback signaling determine susceptibility of breast cancer cells to MEK inhibition. Cancer Res 69: 565–572.
  33. 33. Yoon S, Ebert JC, Chung EY, De Micheli G, Altman RB (2007) Clustering protein environments for function prediction: finding PROSITE motifs in 3D. BMC Bioinformatics 8: Suppl 4S10.
  34. 34. Subbiah S, Laurents DV, Levitt M (1993) Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol 3: 141–148.
  35. 35. Kellenberger E, Muller P, Schalon C, Bret G, Foata N, et al. (2006) sc-PDB: an annotated database of druggable binding sites from the Protein Data Bank. J Chem Inf Model 46: 717–727.
  36. 36. Ortiz AR, Strauss CE, Olmea O (2002) MAMMOTH (matching molecular models obtained from theory): an automated method for model comparison. Protein Sci 11: 2606–2621.
  37. 37. Lill MA, Danielson ML (2011) Computer-aided drug design platform using PyMOL. J Comput Aided Mol Des 25: 13–19.