Machine learning modeling of family wide enzyme-substrate specificity screens

doi:10.1371/journal.pcbi.1009853

Machine learning modeling of family wide enzyme-substrate specificity screens

Fig 2

Assessing enzyme discovery in family wide screens.

(A) CPI models are compared against the single task setting by holding out enzymes for a given substrate and allowing models to train on either the full expanded data (CPI) or only data specific to that substrate (single-task). (B) AUPRC is compared on five different datasets, arranged from left to right in order of increasing number of enzymes in the dataset. Baseline models are compared against multi-task models, CPI models, and single-task models. K-nearest neighbor (KNN) baselines are calculated using Levenshtein edit distances to compare sequences; multi-task models use a shared feed forward network (FFN) to compute predictions against all substrate targets, CPI models utilize FFN with either concatenation (“[{prot repr.}, {sub repr.}]”) or dot product interactions (“{prot repr.}•{sub repr.}”), and ridge regression is used for single-task models. ESM-1b features indicate protein features extracted from a masked language model trained on UniRef50 [20]. Halogenase and glycosyltransferase datasets are evaluated using leave-one-out splits, whereas BKACE, phosphatase, and esterase datasets are evaluated with 5 repeats of 10 different cross validation splits. Standard error bars indicate the standard error of the mean of results computed with 3 random seeds. Each method is compared to the single-task L2-regularized logistic regression model (“Ridge: ESM-1b”) using a 2-sided Welch T test, with each additional asterisk representing significance at [0.05, 0.01, 0.001, 0.0001] thresholds respectively after application of a Benjamini-Hochberg correction. (C) Average AUPRC on each individual “substrate task” is compared between compound protein interaction models and single-task models. Points below 1 indicate substrates on which single-task models better predict enzyme activity than CPI models. CPI models used are FFN: [ESM-1b, Morgan] and single-task models are Ridge: ESM-1b. (D) AUPRC values from the ridge regression model are plotted against the average enzyme similarity in a dataset, with higher enzyme similarity revealing better predictive performance. (E) AUPRC values from the ridge regression model broken out by each task are plotted against the fraction of active enzymes in the dataset. Best fit lines are drawn through each dataset to serve as a visual guide.

doi: https://doi.org/10.1371/journal.pcbi.1009853.g002