Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

doi:10.1371/journal.pcbi.1005929

Fig 1.

An illustration of the topology based machine learning algorithms used in scoring and virtual screening.

More »

Expand

Table 1.

Pearson correlation coefficients (RMSE in kcal/mol) of ligand based topological model on the S1322 dataset.

More »

Expand

Table 2.

Description of the PDBBind datasets.

More »

Expand

Table 3.

Pearson correlation coefficients (RMSE in kcal/mol) of different protein-ligand complex based approaches on PDBBind datasets.

More »

Expand

Table 4.

Parameters used in machine learning.

More »

Expand

Table 5.

Performance on each protein in DUD dataset.

More »

Expand

Table 6.

AUC comparison of different methods on DUD dataset.

More »

Expand

Fig 2.

Statistics of ligands in 7 protein clusters in S1322 dataset.

The average numbers of heavy atoms of a ligand in each protein cluster are shown in red and the standard deviations of number of heavy atoms across each protein cluster are shown in blue. The number of ligands in each cluster is given in parentheses.

More »

Expand

Fig 3.

An illustration of similarities between ligands measured by their barcode space Wasserstein distances.

Ligands are ordered according to their binding affinities and are represented as dots on the semicircle. Specifically, a sample of binding free energy x is plotted at the angle θ = π(E_max − x)/(E_max − E_min) where E_min and E_max are the lowest and the highest energy in the dataset. Each dot is connected with two nearest neighbors based on their barcode space Wasserstein distances. An optimal prediction would be achieved if lines stay close to the semicircle. The majority of the connections stay near the boundary to the upper half sphere demonstrating that barcode space metric based Wasserstein distance measurement reflects the similarity in function, i.e., the binding affinity in this case. The protein clusters with the best and the worst performance are shown. Left: Protein cluster 2. Right: Protein cluster 3.

More »

Expand

Table 7.

Experiments for ligand-based protein-ligand binding affinity prediction of 7 protein clusters and 1322 protein-ligand complexes.

More »

Expand

Table 8.

Performance of different approaches on the S1322 dataset.

More »

Expand

Fig 4.

Plot of performance against number of element combinations used.

The topological learning model performance against the number of element combinations involved in feature construction for 7 protein clusters in S1322. The horizontal axis corresponds to the number of element combinations used for the features. From left to right, one extra element combination is added at a step. The features are then used in gradient boosting trees method to test if the model is robust against redundant information. The results related to alpha complex are marked in red and Rips complex in blue. The median Pearson correlation coefficient between predicted and experimental results is reported of 10-fold cross-validation within each protein cluster repeated 20 times are reported.

More »

Expand

Table 9.

Experiments for protein-ligand-complex-based protein-ligand binding affinity prediction for the PDBBind datasets.

More »

Expand

Fig 5.

Feature robustness tests on PDBBind datasets.

The performance of the topological learning model against the number of included element combinations for predicting on PDBBind core sets and training on PDBBind refined sets minus the core sets. The 1st and 2nd dimensional barcodes computed with alpha complex is used. Features are generated following barcode statistics method. Element combinations are all possible paired choices of one item from {C, N, O, CN, CO, NO, CNO, CNOS} in protein and another item from {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS, CNOSPFClBrI} in ligand, which result in 128 element combinations. The horizontal straight lines represents the performance of the 2D representation with deep convolutional neural network (row 10 in Table 10). The blue and red colors correspond to Pearson correlation coefficient and RMSE (kcal/mol) respectively. Each experiment is done by training on refined set minus the core set with the median result of 20 repeated runs reported.

More »

Expand

Table 10.

Performance of different protein-ligand complex based approaches on the PDBBind datasets.

More »

Expand

Fig 6.

Assessment of performance of the model on samples with elements that are rare in the data sets.

For the four data sets PDBBind v2007, v2013, v2015, and v2016 [99], and for each element, the testing set is the subset of the original core sets with only ligands that contain atoms of the particular element type. The features used are features with ID = 7 in Table 10. The reported RMSE is the average taken over the four data sets. Experiment 1: Training set is the original training set and all the features are used. Experiment 2: Training set is the original training set and only features that do not involve the particular element are used. Experiment 3: Training set is the original training set excluding the samples that contain atoms of the particular element type and all features are used. For most of the elements, experiment 1 achieves the best result and experiment 3 yields the worst performance.

More »

Expand

Fig 7.

Heat map plot of the 16 channels.

The mean value (left image) and the standard deviation (right image) of each digit over the PDBBind v2016 refined set are shown. The top 8 maps are for protein-ligand complex and the other 8 maps are for the difference between protein-ligand complex and protein only. For each map, the vertical axis is the element combinations ordered according to their importance and the horizontal axis is the dimension of spatial scales.

More »

Expand

Fig 8.

Multi-level persistent homology on simple small molecules.

Illustration of representation ability of in reflecting structural perturbations among conformations of the same molecule. Left: The structural alignment of two conformations of the ligand in protein-ligand complex (PDB:1BCD). Right: The persistence diagram showing the 1st and 2nd dimensional results generated using Rips complex with for two conformations. It is worth noticing that the barcodes generated using Rips complex with M are identical for the two conformations.

More »

Expand

Fig 9.

Illustration of dividing set of barcodes into subsets.

The barcodes are plotted as persistence diagrams with the horizontal axis being birth and the vertical axis being death. From left to right, the subsets are generated according to the slicing of death, birth, and persistence values.

More »

Expand

Fig 10.

The 2D topological maps of the 16 channels of sample 1wkm.

The top 8 maps are for protein-ligand complex and the other 8 maps are for the difference between protein-ligand complex and protein only. For each map, the horizontal axis is the dimension of spatial scale and the vertical axis is element combinations ordered by their importance.

More »

Expand

Fig 11.

The network architecture of TopBP-DL.

The structured layers are shown in boxes/rectangles with sharp corners for 2D/1D image-like content and the unstructured layers are shown in rectangles. The numbers in convolution layers mean the number of filters and filter size from left to right. The dense layers are drawn with number of neurons and activation function. The pooling size of the pooling layers and dropout rate of the dropout layers are listed. The layers that are repeated n times are marked with “×n” sign on the right side of the layer.

More »

Expand

Fig 12.

The network architecture of TopVS-DL.

The 1D image-like layers are shown in sharp-corner rectangles. The numbers in convolution layers mean the number of filters and filter size from left to right. The pooling size of the pooling layers and dropout rate of the dropout layers are listed. The layers that are repeated n times are marked with “×n” sign on the right side of the layer.

More »

Expand