Fig 1.
In level A, a protein amino acid sequence and a small molecule SMILES string are transformed into input tokens. The protein tokens are converted to embedding vectors using the trained ESM-1b model, while the SMILES tokens are mapped to embedding vectors using the trained ChemBERTa2 model. In level B, all tokens are mapped to the same embedding space, and are utilized as input sequence for a Transformer Network. In level C, the Transformer Network processes the input tokens and puts out an updated embedding of a classification token (cls), which incorporates information from both the protein and small molecule. In level D, this cls vector, in combination with the ESM-1b and ChemBERTa2 vectors, serves as the input for gradient boosting models trained to predict protein-small molecule interactions.
Table 1.
Performance metrics for ProSmith and previously published methods for DTA prediction on the random split of the Davis dataset.
Bold numbers highlight the best performance for each metric. Numbers in brackets indicate the standard deviation across the 5 repeated training runs with different splits. Numbers after the method name show year of publication. Performance scores, except for the results of ProSmith, are taken from Ref. [12]. Arrows next to the metric names indicate if higher (↑) or lower (↓) values correspond to better model performance.
Table 2.
Performance metrics for ProSmith and previously published methods for DTA prediction for different splitting scenarios.
Bold numbers highlight the best performance for each metric under each scenario. Numbers in brackets indicate the standard deviation across the 5 repeated training runs with different splits. Arrows next to the metric names indicate if higher (↑) or lower (↓) values correspond to better model performance. Performance scores, except for the results of ProSmith, are taken from Ref. [12].
Fig 2.
For accurate DTA predictions, ProSmith requires training on identical drugs but not on similar proteins.
(a) We separately analyzed model performance for dataset splits where drugs from the test set occur in the training set for a specified number of times (0, 1, 3, 10, 30, 100, and > 300). We calculated the coefficient of determination R2 for each of those test sets separately. (b) We divided all five randomly created test sets under the cold target splitting scenario into subsets with different levels of protein sequence identity compared to proteins in the training set, calculating the coefficient of determination R2 for each subset separately. Numbers above the plotted points indicate the number of test data points in each category.
Table 3.
Performance metrics for ProSmith and ESP for the prediction of enzyme-substrate pairs.
Bold numbers highlight the best performance for each metric. Arrows next to the metric names (↑) indicate that higher values correspond to better model performance.
Fig 3.
ProSmith outperforms the ESP model in the prediction of enzyme-substrate pairs especially for molecules with limited representation in the test data.
(a) We grouped small molecules from the test set by how often they occur as substrates among all positive data points in the training set, calculating the MCC for each group separately. (b) We divided the test set into subsets with different levels of maximal enzyme sequence identity compared to enzymes in the training set, calculating the MCC for each group separately. The numbers of data points within each subset of panel (a) are listed in S4 Table and for panel (b) in S5 Table.
Table 4.
Performance metrics of ProSmith and previously published methods for the prediction of Michaelis constants KM.
Metrics were calculated using the same training and test data for all three models. Bold numbers highlight the best performance for each metric. Arrows next to the metric names indicate if higher (↑) or lower (↓) values correspond to better model performance.
Fig 4.
The optimal ProSmith models combine predictions based on the multimodal Transformer Network with predictions based on separate numerical representations of proteins and small molecules.
The bar plots quantify the weights assigned to the predictions of the three distinct gradient boosting models contributing to ProSmith: the model trained only on the cls token from ProSmith’s multimodal Transformer Network (teal); the model combining ESM-1b and ChemBERTa2 vectors (blue); and the model combining all three input vectors (grey). The weights are displayed separately for the distinct prediction tasks: drug-target affinity (DTA) (four different splits); enzyme-substrate pairs; Michaelis constants KM. Numbers in square brackets show the number of training data points.