Figures
Abstract
With a burgeoning number of artificial intelligence (AI) applications in various fields, biomolecular science has also given a big welcome to advanced AI techniques in recent years. In this broad field, scoring a protein-ligand binding structure to output the binding strength is a crucial problem that heavily relates to computational drug discovery. Aiming at this problem, we have proposed an efficient scoring framework using deep learning techniques. This framework describes a binding structure by a high-resolution atomic graph, places a focus on the inter-molecular interactions and learns the graph in a rational way. For a protein-ligand binding complex, the generated atomic graph reserves key information of the atoms (as graph nodes), and focuses on inter-molecular interactions (as graph edges) that can be identified by introducing multiple distance ranges to the atom pairs within the binding area. To provide more confidence in the predicted binding strengths, we have interpreted the deep learning model from the model level and in a post-hoc analysis. The proposed learning framework has been demonstrated to have competitive performance in scoring and screening tasks, which will prospectively promote the development of related fields further.
Author summary
The binding between a small compound (ligand) and a protein plays a crucial role in many biological processes, such as signal transduction and immunoreaction. Particularly, a small-molecule drug can bind to a target protein to modulate its signaling pathways and suppress the progression of the associated disease. Apparently, the binding strength is a key indicator for evaluating how well such small-molecule drugs work, therefore becoming a core topic in computational drug discovery. Nowadays, the binding structure of a ligand and its target protein can be resolved experimentally or modeled computationally, while the accurate scoring of such a binding structure (predicting the binding strength) still remains a challenge. An effort has been put into the development of benchmark databases that provide a variety of protein-ligand binding structures and their experimentally resolved binding strengths, leading to increasing deep learning applications in this field. In this study, we represent a protein-ligand binding structure as a graph, with the atoms as nodes and the inter-molecular interactions as edges. A light but efficient deep learning architecture has been adopted for learning such graphs and outputting the binding strengths. Validated by our experiments, the model performs well in both scoring and screening tasks.
Citation: Wang DD, Huang Y (2025) Scoring protein-ligand binding structures through learning atomic graphs with inter-molecular adjacency. PLoS Comput Biol 21(5): e1013074. https://doi.org/10.1371/journal.pcbi.1013074
Editor: Mohammad Sadegh Taghizadeh, Shiraz University, IRAN, ISLAMIC REPUBLIC OF
Received: April 7, 2024; Accepted: April 21, 2025; Published: May 9, 2025
Copyright: © 2025 Wang, Huang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All the original data for model construction are available from the PDBbind database (https://www.pdbbind-plus.org.cn/). Specifically, the Version 2020 (PDBbind v2020) was used in this work, and it can be accessed from the Download section of PDBbind+ website. The CASR database was used for evaluating the scoring performance of constructed models (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3753885/). These data have been cleaned, standardized, and stored in Zenodo (https://zenodo.org/records/15023336). The screening power of the constructed models were measured using the data from DUD-E (https://dude.docking.org/). All code files are available from an online GitHub repository at https://github.com/debbydanwang/DL-PLBAP. A Docker container with a trained model pre-installed is available for access on Zenodo (https://zenodo.org/records/15023336).
Funding: This work was supported by Hong Kong Research Grants Council (Project UGC/FDS16/E16/23 to DDW) and Hong Kong Metropolitan University (Project 2023/24 S&T to DDW). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors received salaries from Hong Kong Metropolitan University.
Competing interests: The authors have declared that no competing interests exist.
Introduction
‘AI for science has attracted considerate attention in the past decade. Quite a number of powerful mathematical algorithms have been developed in this field, to rise to the challenging tasks in dermatology [1], precision medicine [2], molecular science [3] and drug discovery [4].
As a crucial problem in computer-aided drug discovery (CADD), scoring a protein-ligand complex structure to exhibit its binding strength (Fig 1) always seeks for breakthroughs in AI developments. Such binding strength, as a key indicator of the efficacy of a drug that attaches to its target protein, can mostly be attributed to various non-covalent interactions (e.g. hydrogen bonds, hydrophobic contacts and -stacking). Earlier AI-based scoring works leveraged traditional machine-learning algorithms (e.g. random forests) to parse a feature vector, which describes the interactions in a protein-ligand complex structure, and mapped the vector to the binding strength [5–10]. It lasts until the emergence of deep learning, which reached its scientific milestones by the launch of AlphaFold (for near-perfect protein-fold predictions) [11] and the GPT-series (strong human-like chatbots) [12].
Scoring a protein-ligand complex structure to exhibit the binding strength.
When first introduced to the works of scoring molecular binding strength, deep learning was primarily utilized in the manner of convolutional neural networks (CNNs) [13–18]. Accordingly, molecular lattices or grids, with each cell characterized by a collection of atomic properties (e.g. physico-chemical or pharmacophoric), are the de facto feature representations of a protein-ligand complex. The KDEEP model adopts a molecular lattice representation with a size of and a set of eight atomic properties (pharmacophoric) for delineating a complex structure, and feeds the lattice into a 3D-CNN for binding strength prediction [13]. The Pafnucy model compresses 19 atomic properties (both physico-chemical and pharmacophoric) of a complex structure into a molecular lattice (
), and employs a simple 3D-CNN architecture for learning the lattice [14]. Rezaei et al. developed a light-weight 3D-CNN model for scoring, based on
molecular lattices that concern 24 atomic features (11 Arpeggio atom types and the excluded volume for both protein and ligand) [15]. Although has opened a new venue for scoring works, deep lattice learning often lacks rotational invariance in data and is therefore resource-intensive after data augmentation [13, 14].
More recently, molecular graph learning has become a prevalent technique for the scoring works. In this context, a protein-ligand complex structure is commonly represented as a 2D atomic graph, which is then decoded by graph neural networks (GNNs) [19–23]. GraphBAR adopts a molecular graph with distance-dependent edges to characterize the binding-site atoms in a complex structure, and employs a spectral graph convolutional network (GCN) to map the graphs to the binding strengths [20]. Shen et al. considered the covalent connections for atoms in the binding area, and leveraged a cascade GCN (with two concatenated modules) for graph learning and binding strength prediction [21]. GraphscoreDTA represents a complex by a fusion of graphs (a 1D amino-acid graph for protein, an atomic graph for ligand, and a hybrid graph for the binding pocket), and predicts the binding strength using a GNN with a bitransport information mechanism and Vina distance terms [23]. Zhang et al. utilized a similar graph representation and developed a multi-objective GNN model for binding-strength scoring [22]. These pioneer works have shed light on modern scoring works. Nevertheless, there is still much room for improvement in developing target-oriented graph representation, achieving high screening power and making the model more transparent. Accordingly, we are dedicated to the design of efficient scoring models, with informative molecular graphs, descent screening power and reasonable interpretability, in this work.
Materials and methods
Atomic-level molecular graphs
A molecular graph can be represented as , where V indicates nodes
and E stands for edges connecting those nodes
.
To capture sufficient information in a molecule, treating its atoms as graph nodes is a well-acknowledged strategy. Each node or atom is then characterized by a series of physico-chemical or pharmacophoric properties, leading to a feature matrix of all the nodes in the graph (m is the number of properties). As molecules like proteins are very large in terms of atoms, retaining all the atoms is a heavy burden to the computations and therefore task-oriented cropping is frequently performed. For scoring a protein-ligand binding structure, the atoms in the binding area is often of interest. This results in a smaller feature matrix
, where n is the number of nodes in the binding area (
).
Generally, a graph in deep learning works shows the connections between nodes by an adjacency matrix A, where Aij indicates an edge between the i-th and j-th nodes. However, designing task-specific graph edges, especially for tasks involving molecules, is often challenging. The covalent bonds, contacts defined through distance thresholding, or a combination of them have been regarded as edges in different works [20, 21, 24]. Considering the atoms in the binding area of a protein-ligand complex, Fig 2A shows the covalent adjacency among those atoms. Interactions or contacts between a pair of atoms (ndi and ndj) can also be defined by the range where the atomic distance (dij) resides, leading to multi-level distance-dependent adjacencies among atoms. Fig 2B displays two types of atomic contacts ( and
). In Fig 2C, a hybrid type of adjacencies (covalent bond and distance-dependent contacts) is considered. For scoring tasks, these adjacency definitions either emphasize the covalent bonds, or mix the inter- and intra-molecular interactions, resulting in the loss of focus on the inter-molecular interactions. Nevertheless, these inter-molecular interactions play a pivotal role in determining the binding strength between a ligand and its target protein. Accordingly, we focus on the inter-molecular contacts in this work, and define multi-level atomic adjacencies by one-hot encoding of those contacts belonging to different distance ranges (Fig 2D). Such adjacencies can be stored in an adjacency tensor A, where each slice
shows all pairs of nodes having distances in range
, as follows.
A. Covalent adjacency. B. Distance-dependent contacts. C. A combination of covalent adjacency and distance-dependent contacts. D. Inter-molecular contacts through distance thresholding.
Algorithm 1 shows the procedure for generating such an adjacency tensor for a protein-ligand binding area.
Algorithm 1 Generating an Inter-molecular Adjacency Tensor
Input: Coordinates for the atoms in the binding area (
), a list of distance ranges
Output: An inter-molecular adjacency tensor A
Initialize (
).
Calculate the distance matrix based on
.
for k = 1 to K do
is an indicator function
for i = 1 to n do
for j = i + 1 to n do
if and ndi-ndj is not a protein-ligand atom pair then
Turn off intra-molecular interactions
end if
end for
end for
end for
Graph-based deep learning
Given a graph with a node-feature matrix F and an adjacency matrix A, message-passing mechanisms are frequently adopted for learning such a graph [25]. These mechanisms, as shown in Eq 2, update nodes features iteratively in a local context.
Here, is the features describing the i-th node in the l-th layer,
indicates the neighborhood of the i-th node (based on the adjacency matrix),
is the message passed from the j-th node to the i-th node in the l-th layer,
denotes a permutation-invariant function (e.g. average), and
is an update function such as a neural network.
Although a wide variety of graph neural networks (GNNs) have been developed, properly learning molecular graphs still remains a challenge. The ChebNet [26], leveraging spectral graph convolutions, is among the well-acknowledged GNNs. It has an efficient form for updating node features in each iteration, as follows.
Here, is the feature matrix for all the nodes in the l-th layer,
is an activation function,
is the weight matrix, and
is a normalized adjacency matrix with self-adjacencies (
and
). Such graph-learning operations can be stacked into L layers. From the message-passing perspective, this mechanism can be regarded as a simple average of the normalized information collected from the neighborhood of a node.
Previously, ChebNet has been questioned about its capability for capturing long-range dependence among the nodes in a graph. However, scoring protein-ligand binding strength is a work that largely concerns local contexts (e.g. a key hydrogen bond or an important interaction), making the ChebNet mechanism fit well in this task. When focusing on only the inter-molecular interactions (Fig 2D), we update the features once (Eq 3) to learn the neighborhoods of binding-site atoms in this work. Higher-order graph convolutions, which will involve intra-molecular interactions and be computationally expensive, are not considered. This strategy places an absolute focus on the inter-molecular interactions (crucial to scoring works) and is of high efficiency. Since L = 1 in this scenario, the layer notation l will be omitted for simplicity in what follows.
Instead of using a single adjacency matrix A, an adjacency tensor that covers different adjacency (interaction) types is of necessity in a scoring task. As inter-molecular interactions are mostly non-covalent (atomic distance ), multiple distance ranges starting from
can be nominated to construct the inter-molecular adjacency tensor. Fig 2D exhibits a two-slice adjacency tensor A as follows.
Here we adopt such a tensor because has been verified to be a distance threshold for capturing sufficient inter-molecular interactions in a binding complex [20].
Targeting at each type of inter-molecular contacts (), the graph nodes can be learned using the message-passing mechanism in ChebNet, as
where ,
is a diagonal matrix showing the degree of each node in
, and all the other notations follow Eq 3.
After collecting the messages from direct neighbors of graph nodes, we gather the information into the graph level for the scoring purpose. Such an aggregation function is permutation-invariant and similar to that in Eq 2. A simple summation in the following equation serves as an example.
The features for different inter-molecular interactions () are then concatenated before being fed into dense layers for final graph-level predictions.
Here, indicates an concatenation of features and h stands for the hidden features describing the whole binding-site graph.
Referring to a well-established architecture (GraphBAR [20]), we developed a light graph-learning architecture that focuses only on inter-molecular interactions and learns the interactions through direct atomic neighborhoods (Fig 3).
The node feature matrix F and inter-molecular adjacency tensor A of a protein-ligand complex are the inputs, and the binding strength is the output. Main components of this architecture include graph convolution layers, node aggregation layers, dense (fully-connected) layers and dropout layers.
Experiment and results
Scoring performance of models
The aforementioned framework scores the binding strengths of protein-ligand complex structures through learning Atomic Graphs with Inter-Molecular Adjacency (AGIMA-based scoring, abbreviated as AGIMA-Score). In an AGIMA-Score model, the binding area of a complex structure is treated as a graph, represented by a node-feature matrix and an adjacency tensor
. Here the binding area is recognized as all the ligand atoms and the protein atoms within
-distance of any ligand atom, referring to Son’s work [20]. Three sets of node features, referring to Pafnucy (
, m = 18) [14], KDEEP (
, m = 8) [13] and GraphBAR [20] (
, m = 13) respectively, were adopted to construct F (Table 1).
includes generic physico-chemical properties (e.g. atom types and partial charge) and pharmacophoric properties (e.g. aromaticity and hydrogen-bond membership) of atoms.
is a subset of
that excludes pharmacophoric properties.
focuses on pharmacophoric properties, with atomic charge and excluded volume considered.
Three feature sets, with 18 features (from Pafnucy), 8 features (from KDEEP) and 13 features (from GraphBAR) respectively, were considered in this study. The names and data types of these features are listed.
When constructing the inter-molecular adjacency tensor A (Fig 2D), two distance ranges were selected for capturing multi-level protein-ligand interactions, as follows. As a pair of atoms having a distance are mostly connected by a covalent bond, we paid more attention to the atom pairs being
apart for characterizing the inter-molecular interactions (non-covalent). Meanwhile, the binding area is recognized according to a pairwise atomic distance of
, we employed the two distance ranges,
and
, in Eq 5 to build the adjacency tensor (
) in this work. Combining the three node-feature matrices (
,
and
) and the adjacency tensor ( A) in the generation of molecular graphs, we constructed three AGIMA-Score models (AGIMA-Score18, AGIMA-Score8 and AGIMA-Score13) based on the graph-learning architecture in Fig 3. To investigate whether a single distance range of
can cover sufficient adjacency information, we built a single-matrix adjacency tensor (
) to pair up with the three node-feature matrices for each protein-ligand complex. This led to the construction of three new models (AGIMA-Score
, AGIMA-Score
and AGIMA-Score
) for comparison purpose. In addition, the non-redundant features (
) from
were collected and combined with the adjacency tensor A to build the AGIMA-Score21 model. The dense layers each have a dimension of 128, and the number of epochs and batch size were tuned when constructing these models.
In order to evaluate the performance of these models comprehensively, several broadly-discussed, deep-learning scoring models were implemented as competing models. These include the Atom Convolutional Neural Network (ACNN) [27], OnionNet [18], KDEEP [13] and GraphBAR [20]. For ACNN, parameters including pooling filters, number of epochs and batch size were tuned in our work to reach the best model. Number of epochs and batch size were tuned for OnionNet and KDEEP (no data augmentation). As two similar graph-learning approaches, GraphBAR considers both intra- and inter-molecular contacts (Fig 2B) while AGIMA-Score focuses on only the inter-molecular contacts (Fig 2D) in the construction of molecular graphs. To make a fair comparison with AGIMA-Score models, we constructed two GraphBAR models, GraphBAR2AM and GraphBAR3AM, based on the architecture in Fig 3. GraphBAR2AM takes into account the intra-/inter-molecular contacts within and those within
when building the adjacency tensor, which corresponds to the AGIMA-ScoreSAM models (considering inter-molecular contacts in
). GraphBAR3AM adopts three distance ranges (
,
and
) to collect the intra-/inter-molecular adjacencies, corresponding to the AGIMA-Score models (considering inter-molecular contacts in
and
). The number of epochs and batch size were treated as tuning parameters for the two GraphBAR models.
The AGIMA-Score and competing models were constructed based on the benchmark PDBbind database (https://www.pdbbind-plus.org.cn/). The Refined Set and Core Set in this database were employed for training and parameter tuning (validation). Each sample in these two sets is a protein-ligand complex structure (determined mostly by X-ray crystallography or NMR spectroscopy) with the experimentally resolved binding strength (−logKd/i). These structural and binding-strength data are of high quality as they have gone through rigorous filtering processes [28, 29]. To avoid potential train-validation contamination, each pair of complexes, one from the validation set and the other from the training set, needs to pass a similarity test. This test guarantees that the similarity of two protein sequences is below 0.3 or the similarity of two ligands is below 0.7, in each pair of complexes. Protein sequence similarities were generated using the crossSetSim function from the protr R library, with the default BLOSUM62 substitution matrix. Ligand similarities were calculated using the cmp.similarity function from the ChemmineR library, based on SMILES-transformed descriptors. The complexes against this rule were removed from the Validation set. Two sets from CSAR [30] were regarded as the final test sets, named Test1 and Test2, in case of the over-optimistic results yielded from using the same-source data sets. The aforementioned similarity test was also performed on each test set vs. training set to prevent from potential train-test contamination. After this cleaning, the similarity statistics for pairwise complexes, with one complex from the training set and the other from the Validation, Test1, or Test2 set, are presented in Fig 4. Furthermore, the same protocol was adopted to ensure that there was no contamination among the Validation, Test1 and Test2 sets. Finally, the filtered Training, Validation, Test1 and Test2 sets consist of 5007, 195, 116 and 102 complex structures respectively. The lists of complexes for these sets can be found in Zenodo (https://zenodo.org/records/15023336).
The horizontal axis stands for the similarity between the two protein sequences involved in a complex pair, and the vertical axis indicates the similarity between the two involved ligands. The red dotted line means a sequence similarity of 0.3 and the yellow line shows a ligand similarity of 0.7.
The performance of each model was evaluated according to (1) the Pearson’s Correlation (PC) between the experimental and predicted binding strengths of the complex structures and (2) the room-mean-square-error (RMSE) concerning those binding strengths. The evaluation results are now listed in Table 2.
The models were trained on PDBbind Refined Set (version V2020) with parameters tuned via the Core Set (version V2020), and tested on two sets from the CSAR source. State-of-the-art deep learning models (ACNN, OnionNet, KDEEP and GraphBAR) for scoring the protein-ligand complexes were realized, to comprehensively evaluate the proposed AGIMA-Score models. For GraphBAR, different graph adjacency schemes (2 or 3 adjacency matrices) were adopted for model construction. For AGIMA-Score, different node features (separately referring to Pafnucy, KDEEP and GraphBAR) and adjacency schemes (2 adjacency matrices or single adjacency matrix) were considered for model investigation. By default, 2 adjacency matrices (generated by intermolecular atomic contacts within and those within
) were adopted in the graph learning by AGIMA-Score. Best performance in terms of PC and RMSE were underlined for the state-of-the-art methods and the proposed AGIMA-Score models.
An ACNN model is underfitted easily while a KDEEP model is often overfitted. Among the earlier models (ACNN, OnionNet, KDEEP and GraphBAR), GraphBAR outperforms the others in terms of Test1-PC, Test1-RMSE and Test2-RMSE, while OnionNet reaches the best PC on Test2 set. Compared to these earlier models, the AGIMA-Score models perform well in average. AGIMA-Score18 achieves the overall best performance because of the lowest Test1-RMSE, highest Test2-PC and lowest Test2-RMSE. AGIMA-Score8 attains the best performance with respect to Test1-PC. Although using a single adjacency matrix (AGIMA-ScoreSAM) eases the computations in graph learning, it often results in an underperformance in terms of PC and RMSE (compared to AGIMA-Score). It shows that using two adjacency matrices captures more information about the connections among atoms, such as the strong and weak hydrogen bonds with donor-acceptor distances of and
separately [31]. The outperformance of AGIMA-Score over GraphBAR demonstrates the efficacy of learning the molecular graphs with inter-molecular adjacencies, rather than mixed intra- and inter-molecular adjacencies, in a scoring task. Overall, these results reveal the strong competitiveness of AGIMA-Score models in scoring the binding strength of a protein-ligand complex. A Docker container with the trained AGIMA-Score18 model pre-installed is available for access on Zenodo (https://zenodo.org/records/15023336).
Screening performance of models
The scoring performance shows the capability of ranking a list of binding complexes or predicting the accurate binding strengths. Beyond that, the screening power is another indicator of interest for further evaluating the prediction models. As a more practical task, virtual screening aims to discover the potential binders for a target protein, in order to mitigate the burden of downstream biochemical experiments (Fig 5). Such a target protein often plays a key role in regulating the progression of some diseases, exemplified by the epidermal growth factor receptor (EGFR) protein that mediates the growth of non-small-cell lung cancer (NSCLC). Modeling the binding structure of each protein-ligand pair (docking) and scoring the binding strength based on this structure (scoring) are the two primary subtasks in virtual screening. Current state-of-the-art docking tools (e.g. AUTODOCK [32] and Glide[33]) can provide near-experimental binding structure for a pair of protein and ligand, while accurately scoring such a binding structure has long been a challenge. This work aims mainly at the scoring phase. Once we have the scored binding structures according to a model, whether the true binders for the target protein can be highly ranked is a main indicator of the screening capability of this model. A model with both high scoring and high screening power is always the pursuit of the CADD community.
The task starts from a target protein and a big library of ligands, followed by the modeling of each protein-ligand binding structure (docking tool) and the scoring of the binding structures (scoring model). The highly-ranked ligands (according to the predicted scores) will be regarded as potential binders for further biochemical experiments.
To evaluate the screening power of AGIMA-Score models and their competitors, we selected a comparably large set from the DUD-E source (https://dude.docking.org). This set concerns the aforementioned EGFR and its potential ligands. A total of 36,273 ligands have been included in this set, with 832 actives (binders) and 35,441 decoys (non-binders) for the EGFR protein. Noteworthily, the models discussed above accept protein-ligand binding structures as inputs, but this EGFR set only contains the structures of the monomers (EGFR protein and ligands). Accordingly, we paired up the protein with each ligand into a binding structure by the well-acknowledged AUTODOCK Vina docking tool, before feeding the structures into the models. The best binding pose was retained for each protein-ligand pair, based on the default setting in AUTODOCK Vina and a reference structure (PDB:2RGP). The generated 36,273 EGFR-ligand binding structures were then fed into each model (ACNN, OnionNet, KDEEP, GraphBAR2AM, GraphBAR3AM, AGIMA-Score18, AGIMA-Score8 and AGIMA-Score13) to predict the binding strengths.
Enrichment factor (EF) is a widely-used index for evaluating the screening performance of a scoring model. It is defined as , where
is the percentage of actives in the top
ranked ligands. Meanwhile, the total decoy-to-active ratio (
) for this set is
, indicating a high imbalance between actives and decoys. To provide a more comprehensive evaluation, we composed a series of secondary sets according to varying rDTA values and assessed the corresponding EFs for each model on these sets. Aiming at a specific model MDL, an rDTA (
) and an X value, this procedure is described as follows.
- Keep all the 832 actives and randomly select
decoys to constitute a set of size
.
- According to MDL, score and rank all the EGFR-ligand complexes in each of the set generated above, and calculate
.
- Repeat the process for 10 times to derive the average EF (
).
The top (
) ranked ligands were used for evaluating the screening performance of each model in Table 2, and the results are now exhibited in Fig 6.
In each scenario, the enrichment factors (EFs) regarding various decoy-to-active ratios (rDTA) were calculated for each model, and plotted in a line. The black dashed lines indicate EF = 1.
Here, EF goes worse as rDTA or X goes larger for all the models. However, AGIMA-Score18 performs markedly better than the others. Encouragingly, AGIMA-Score18 even achieves an EF of 26 for the top 1% ranked ligands, when all the decoys are included in the assessment. Similar results for the top ranked ligands are displayed in S1 Fig. Three more tasks, involving target proteins of HIV protease (HIVPR), ADAM17 Protease (ADA17) and tyrosine-protein kinase SRC (SRC), were considered. The HIVPR set covers 37,673 potential ligands (1,395 actives/36,278 decoys) for HIVPR protein. 30,956 (959 actives/29,997 decoys) and 35,790 (831 actives/34,959 decoys) ligands are included in the ADA17 and SRC sets, respectively. The screening performance of the AGIMA-Score models and competitors on these three sets, in terms of the top 1
2% ranked ligands, are displayed in Fig 7. For HIVPR set, AGIMA-Score13 performs the best, with an EF of 29 for the top 1% ligands when involving all decoys. AGIMA-Score8 outperforms the others for the ADA17 set and AGIMA-Score18 is the best performer for the SRC set, with EFs of 13 and 17 for the top 1% ligands (all ligands involved) respectively. The results for the top
ranked ligands are presented in S2 Fig. Such results promote the AGIMA-Score models further.
In each scenario, the enrichment factors (EFs) regarding various decoy-to-active ratios (rDTA) were calculated for each model, and plotted in a line. The black dashed lines indicate EF = 1.
Discussion on model interpretability
Interpretations of deep learning models can build confidence in their predictions, therefore attracting more and more attention in recent years. Here, we discuss the ways to interpret AGIMA-Score models from model level and post-hoc analysis.
Model-level interpretation.
Due to the black-box nature of deep learning models, explaining the intrinsic structures of these models, which often concern millions or even more parameters, is quite difficult. In this regard, we focus mainly on the learning architecture (Fig 3) of AGIMA-Score models. This framework first transforms the original node features to an embedding space, and then considers multi-range, distance-dependent inter-molecular interactions ( and
) between a pair of protein and ligand. They imply important local interaction patterns between the two binding molecules. After a further feature-embedding transformation (
and
), the framework gathers the information from those interaction patterns (by concatenation of
and
) in the binding area. Then it maps the gathered information into the components of molecular binding strength or interaction energy using another hidden layer, leading to the final prediction of total binding strength. Hence, the framework can be partly explained in the perspective of molecular interaction energies.
Post-hoc interpretation.
After a model is constructed, investigating the roles of different features in the decision-making process and monitoring the correlations between some hidden features and the outputs are well-acknowledged strategies for decoding the model in a post-hoc way. Specifically, we employed the masking-based feature importance assessment and principal component analysis (PCA) of key feature embeddings in our work.
Masking-based feature importance assessment. To simplify the scenario, ascertaining the importance of each node feature in the decision-making process for a given AGIMA-Score model is the goal here. For such a model MDL, this assessment procedure is described as follows.
- Implement MDL on predicting the binding strengths of all the complexes in the validation set. Suppose the results are PC0 and RMSE0.
- Mask one node feature (i-th feature) at a time and re-implement MDL on the scoring task. Here, masking a feature means replacing the original features with 0s. As a result, a drop in PC (
) and an increase in RMSE (
) will be derived.
- Rank the node features in terms of PC drops (or RMSE increase), and those corresponding to a large PC drop (or RMSE increase) after being masked are more important in the decision-making process.
The assessment result for AGIMA-Score18 model is now displayed in Fig 8. As shown here, certain pharmacophoric features (e.g. hydrophobicity, hybridization type and ring membership) weigh heavier importance than the atom types (e.g. Carbon, Nitrogen and Oxygen) in the perspective of either PC drop or RMSE increase. It verifies the important role of certain pharmacophoric properties in determining protein-ligand binding, as frequently applied in pharmacophore-based virtual screening [34]. However, other pharmacophoric properties, like hydrogen-bond donors, are of low interest in this scenario. The AGIMA-Score8 model depends on a feature set that combines pharmacophoric properties, atomic charges and excluded volume (Fig 9). In this scenario, the excluded volume and atomic charges (positive or negative) stand out from the crowd of pharmacophoric features. The AGIMA-Score13 model employs a simplified feature set of that from AGIMA-Score18 (Fig 10). Here the partial charge, heavy-atom neighbors and hetero-atom neighbors dominate the PC drop, while atom types are more important in terms of RMSE increase. In summary of the importance plots, atom features such as certain pharmacophoric properties, atomic charges and connections play a vital role in revealing protein-ligand binding.
The result was revealed by the masking-based performance drop on the validation set (PDBbind Core Set).
The result was revealed by masking-based performance drop on the validation set (PDBbind Core Set).
The result was revealed by masking-based performance drop on the validation set (PDBbind Core Set).
PCA of key feature embeddings. The feature embeddings ( and
) in the last-but-two layer in Fig 3 were monitored in this study, because these hidden features stand for the important molecular interactions learned by an AGIMA-Score model.
and
correspond to the inter-molecular interactions in distance ranges
and
respectively. PCA was adopted to compress these embeddings, and explore their correlations with the molecular binding strength. In order for better visualization, the first principal component (PC1) for
and that for
, of all the complexes in each set were extracted. Examining the correlations between such a PC (of
or
) and the binding strength of a complex can provide useful insights into the logics of AGIMA-Score models. Taking AGIMA-Score8 as an example, the PC1 vs. binding strength plots for the Training, Validation, Test1 and Test2 sets are now shown in Fig 11. The linear trend for
-PC1 vs. binding strength and that for
-PC1 vs. binding strengths were also captured in this figure, where a marked difference in the two trendlines can be observed. This demonstrates two different types of interactions in the process of determining the protein-ligand binding strength.
The feature embeddings (
) in the last-but-two layer of the model architecture were decoded by principal component analysis, and the first principal components of
were correlated with the binding strength via linear regression.
Focusing on the Validation set, the scatter plots for the -PC1 and
-PC1 of this model are displayed in Fig 12, where multiple thresholds for binding strength are also set to reveal the trends. It shows that higher values of interactions (represented by
-PC1 and
-PC1) normally lead to higher binding strengths. A similar analysis for AGIMA-Score13 can be found in S3 and S4 Figs.
-PC1 vs.
-PC1 plots for the validation set are shown. Different thresholds of binding strength were used to uncover the correlations between the PCs and the binding strength.
Conclusion
The AGIMA-Score framework is introduced in this work. It describes a protein-ligand binding structure as an atomic-level graph, with only the inter-molecular interactions taken into consideration. Having a high computational efficiency, this framework places an absolute focus on the learning of the binding area of a protein-ligand complex. Depending on different sets of node features and a neat graph-learning architecture, a number of AGIMA-Score models were constructed. Such models perform well in the scoring of protein-ligand binding strengths and the screening of binders from non-binders for a target protein. At last, they can be explained reasonably from the model level, or in a post-hoc analysis. In the near future, our research will focus on exploring enriched sets of node features and developing more comprehensive approaches for model interpretability.
Supporting information
S1 Fig. The screening performance of each model on the EGFR set, with the top
ranked ligands considered.
In each scenario, the enrichment factors (EFs) regarding various decoy-to-active ratios (rDTA) were calculated for each model, and plotted in a line. The black dashed lines indicate EF = 1
https://doi.org/10.1371/journal.pcbi.1013074.s001
(EPS)
S2 Fig. The screening performance of AGIMA-Score and competing models on the HIVPR, ADA17 and SRC sets, with the top
ranked ligands considered.
In each scenario, the enrichment factors (EFs) regarding various decoy-to-active ratios (rDTA) were calculated for each model, and plotted in a line. The black dashed lines indicate EF = 1.
https://doi.org/10.1371/journal.pcbi.1013074.s002
(EPS)
S3 Fig. Investigation of key feature embeddings in the AGIMA-Score13 model.
The feature embeddings (
) in the last-but-two layer of the model architecture were decoded by principal component analysis, and the first principal components of
were correlated with the binding strength via linear regression.
https://doi.org/10.1371/journal.pcbi.1013074.s003
(EPS)
S4 Fig. Principal component plots of feature embeddings in the AGIMA-Score13 model.
-PC1 vs.
-PC1 plots for the validation set are shown. Different thresholds of binding strength were used to uncover the correlations between the PCs and the binding strength.
https://doi.org/10.1371/journal.pcbi.1013074.s004
(EPS)
References
- 1. Du-Harpur X, Watt FM, Luscombe NM, Lynch MD. What is AI? Applications of artificial intelligence to dermatology. Br J Dermatol. 2020;183(3):423–30. pmid:31960407
- 2. Johnson KB, Wei W-Q, Weeraratne D, Frisse ME, Misulis K, Rhee K, et al. Precision medicine, AI, and the future of personalized health care. Clin Transl Sci. 2021;14(1):86–93. pmid:32961010
- 3. Ivanenkov YA, Polykovskiy D, Bezrukov D, Zagribelnyy B, Aladinskiy V, Kamya P, et al. Chemistry42: an AI-driven platform for molecular design and optimization. J Chem Inf Model. 2023;63(3):695–701. pmid:36728505
- 4. Jayatunga MKP, Xie W, Ruder L, Schulze U, Meier C. AI in small-molecule drug discovery: a coming wave?. Nat Rev Drug Discov. 2022;21(3):175–6. pmid:35132242
- 5. Wang DD, Chan M-T. Protein-ligand binding affinity prediction based on profiles of intermolecular contacts. Comput Struct Biotechnol J. 2022;20:1088–96. pmid:35317230
- 6. Wang DD, Xie H, Yan H. Proteo-chemometrics interaction fingerprints of protein-ligand complexes predict binding affinity. Bioinformatics. 2021;37(17):2570–9. pmid:33650636
- 7. Wang DD, Zhu M, Yan H. Computationally predicting binding affinity in protein-ligand complexes: free energy-based simulations and machine learning-based scoring functions. Brief Bioinform. 2021;22(3):bbaa107. pmid:32591817
- 8. Ballester PJ, Mitchell JBO. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics. 2010;26(9):1169–75. pmid:20236947
- 9. Liu Q, Kwoh CK, Li J. Binding affinity prediction for protein-ligand complexes based on β contacts and B factor. J Chem Inf Model. 2013;53(11):3076–85. pmid:24191692
- 10. Zilian D, Sotriffer CA. SFCscore(RF): a random forest-based scoring function for improved affinity prediction of protein-ligand complexes. J Chem Inf Model. 2013;53(8):1923–33. pmid:23705795
- 11. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9. pmid:34265844
- 12.
Brown T, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, et al. Language models are few-shot learners. In: Advances in Neural Information Processing Systems. 2020. p. 1877–901.
- 13. Jiménez J, Škalič M, Martínez-Rosell G, De Fabritiis G. KDEEP: protein-ligand absolute binding affinity prediction via 3D-convolutional neural networks. J Chem Inf Model. 2018;58(2):287–96. pmid:29309725
- 14. Stepniewska-Dziubinska MM, Zielenkiewicz P, Siedlecki P. Development and evaluation of a deep learning model for protein-ligand binding affinity prediction. Bioinformatics. 2018;34(21):3666–74. pmid:29757353
- 15. Rezaei MA, Li Y, Wu D, Li X, Li C. Deep learning in drug design: protein-ligand binding affinity prediction. IEEE/ACM Trans Comput Biol Bioinform. 2022;19(1):407–17. pmid:33360998
- 16. Wang DD, Chan M-T, Yan H. Structure-based protein-ligand interaction fingerprints for binding affinity prediction. Comput Struct Biotechnol J. 2021;19:6291–300. pmid:34900139
- 17.
Wang D, Wang R. Scoring protein-ligand complex structures by hybridnet. In: 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 2023. p. 4070–5.
- 18. Zheng L, Fan J, Mu Y. OnionNet: a multiple-layer intermolecular-contact-based convolutional neural network for protein-ligand binding affinity prediction. ACS Omega. 2019;4(14):15956–65. pmid:31592466
- 19. Wang DD, Wu W, Wang R. Structure-based, deep-learning models for protein-ligand binding affinity prediction. J Cheminform. 2024;16(1):2. pmid:38173000
- 20. Son J, Kim D. Development of a graph convolutional neural network model for efficient prediction of protein-ligand binding affinities. PLoS One. 2021;16(4):e0249404. pmid:33831016
- 21. Shen H, Zhang Y, Zheng C, Wang B, Chen P. A cascade graph convolutional network for predicting protein-ligand binding affinity. Int J Mol Sci. 2021;22(8):4023. pmid:33919681
- 22. Zhang X, Gao H, Wang H, Chen Z, Zhang Z, Chen X, et al. Planet: a multi-objective graph neural network model for protein–ligand binding affinity prediction. J Chem Inf Model. 2023.
- 23. Wang K, Zhou R, Tang J, Li M. GraphscoreDTA: optimized graph neural network for protein-ligand binding affinity prediction. Bioinformatics. 2023;39(6):btad340. pmid:37225408
- 24. Feinberg EN, Sur D, Wu Z, Husic BE, Mai H, Li Y, et al. PotentialNet for molecular property prediction. ACS Cent Sci. 2018;4(11):1520–30. pmid:30555904
- 25.
Gilmer J, Schoenholz S, Riley P, Vinyals O, Dahl G. Neural message passing for quantum chemistry. In: International Conference on Machine Learning. 2017. p. 1263–72.
- 26.
Kipf T, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint 2016. https://arxiv.org/abs/1609.02907
- 27.
Gomes J, Ramsundar B, Feinberg E, Pande V. Atomic convolutional networks for predicting protein-ligand binding affinity. arXiv preprint 2017. https://arxiv.org/abs/1703.10603
- 28. Su M, Yang Q, Du Y, Feng G, Liu Z, Li Y, et al. Comparative assessment of scoring functions: the CASF-2016 update. J Chem Inf Model. 2019;59(2):895–913. pmid:30481020
- 29. Liu Z, Li Y, Han L, Li J, Liu J, Zhao Z, et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics. 2015;31(3):405–12. pmid:25301850
- 30. Dunbar JB Jr, Smith RD, Damm-Ganamet KL, Ahmed A, Esposito EX, Delproposto J, et al. CSAR data set release 2012: ligands, affinities, complexes, and docking decoys. J Chem Inf Model. 2013;53(8):1842–52. pmid:23617227
- 31.
Jeffrey G, Jeffrey G. An introduction to hydrogen bonding. New York: Oxford University Press. 1997.
- 32.
Huey R, Morris G, Forli S, . Using autodock 4 and autodock vina with autodocktools: a tutorial. Scripps Res Inst Molecul Graph Lab. 2012;10550:1000.
- 33. Friesner RA, Banks JL, Murphy RB, Halgren TA, Klicic JJ, Mainz DT, et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem. 2004;47(7):1739–49. pmid:15027865
- 34.
Horvath D. Pharmacophore-based virtual screening. Chemoinformatics And Computational Chemical Biology. 2011. p. 261–98.