Fig 1.
Scoring a protein-ligand complex structure to exhibit the binding strength.
Fig 2.
Different definitions of atomic adjacency in a molecular graph.
A. Covalent adjacency. B. Distance-dependent contacts. C. A combination of covalent adjacency and distance-dependent contacts. D. Inter-molecular contacts through distance thresholding.
Fig 3.
A light graph-learning architecture adopted in this work.
The node feature matrix F and inter-molecular adjacency tensor A of a protein-ligand complex are the inputs, and the binding strength is the output. Main components of this architecture include graph convolution layers, node aggregation layers, dense (fully-connected) layers and dropout layers.
Table 1.
Node-feature sets for building molecular graphs.
Three feature sets, with 18 features (from Pafnucy), 8 features (from KDEEP) and 13 features (from GraphBAR) respectively, were considered in this study. The names and data types of these features are listed.
Fig 4.
Similarity test for each pair of complexes (Training vs. Validation, Training vs. Test1, and Training vs. Test2).
The horizontal axis stands for the similarity between the two protein sequences involved in a complex pair, and the vertical axis indicates the similarity between the two involved ligands. The red dotted line means a sequence similarity of 0.3 and the yellow line shows a ligand similarity of 0.7.
Table 2.
Scoring Performance Comparison.
The models were trained on PDBbind Refined Set (version V2020) with parameters tuned via the Core Set (version V2020), and tested on two sets from the CSAR source. State-of-the-art deep learning models (ACNN, OnionNet, KDEEP and GraphBAR) for scoring the protein-ligand complexes were realized, to comprehensively evaluate the proposed AGIMA-Score models. For GraphBAR, different graph adjacency schemes (2 or 3 adjacency matrices) were adopted for model construction. For AGIMA-Score, different node features (separately referring to Pafnucy, KDEEP and GraphBAR) and adjacency schemes (2 adjacency matrices or single adjacency matrix) were considered for model investigation. By default, 2 adjacency matrices (generated by intermolecular atomic contacts within and those within
) were adopted in the graph learning by AGIMA-Score. Best performance in terms of PC and RMSE were underlined for the state-of-the-art methods and the proposed AGIMA-Score models.
Fig 5.
The task starts from a target protein and a big library of ligands, followed by the modeling of each protein-ligand binding structure (docking tool) and the scoring of the binding structures (scoring model). The highly-ranked ligands (according to the predicted scores) will be regarded as potential binders for further biochemical experiments.
Fig 6.
The screening performance of each model on the EGFR set, with the top ranked ligands considered.
In each scenario, the enrichment factors (EFs) regarding various decoy-to-active ratios (rDTA) were calculated for each model, and plotted in a line. The black dashed lines indicate EF = 1.
Fig 7.
The screening performances of AGIMA-Score and competing models on the HIVPR, ADA17 and SRC sets, with the top ranked ligands considered.
In each scenario, the enrichment factors (EFs) regarding various decoy-to-active ratios (rDTA) were calculated for each model, and plotted in a line. The black dashed lines indicate EF = 1.
Fig 8.
Importance assessment of the node features involved in the AGIMA-Score18 model.
The result was revealed by the masking-based performance drop on the validation set (PDBbind Core Set).
Fig 9.
Importance assessment of the node features involved in the AGIMA-Score8 model.
The result was revealed by masking-based performance drop on the validation set (PDBbind Core Set).
Fig 10.
Importance assessment of the node features involved in the AGIMA-Score13 model.
The result was revealed by masking-based performance drop on the validation set (PDBbind Core Set).
Fig 11.
Investigation of key feature embeddings in the AGIMA-Score8 model.
The feature embeddings (
) in the last-but-two layer of the model architecture were decoded by principal component analysis, and the first principal components of
were correlated with the binding strength via linear regression.
Fig 12.
Principal component plots of feature embeddings in the AGIMA-Score8 model.
-PC1 vs.
-PC1 plots for the validation set are shown. Different thresholds of binding strength were used to uncover the correlations between the PCs and the binding strength.