Machine learning modeling of family wide enzyme-substrate specificity screens

doi:10.1371/journal.pcbi.1009853

Machine learning modeling of family wide enzyme-substrate specificity screens

Fig 3

Assessing substrate discovery in family wide screens.

CPI models and single task models are compared on the glycosyltransferase, esterase, and phosphatase datasets, all with 5 trials of 10-fold cross validation. Error bars represent the standard error of the mean across 3 random seeds. Each model and featurization is compared to “Ridge: Morgan” using a 2-sided Welch T test, with each additional asterisk representing significance at [0.05, 0.01, 0.001, 0.0001] thresholds respectively, after applying a Benjamini-Hochberg correction. Pretrained substrate featurizations used in “Ridge: JT-VAE” are features extracted from a junction-tree variational auto-encoder (JT-VAE) [53]. Two compound protein interaction architectures are tested, both concatenation and dot-product, indicated with “[{prot repr.}, {sub repr.}]” and “{prot repr.}•{sub repr.}” respectively. In the interaction based architectures, ESM-1b indicates the use of a masked language model trained on UniRef50 as a protein representation [20]. Models are hyperparameter optimized on a held out halogenase dataset. AUCROC results can be found in Fig D in S1 Text.

doi: https://doi.org/10.1371/journal.pcbi.1009853.g003