Machine learning modeling of family wide enzyme-substrate specificity screens

doi:10.1371/journal.pcbi.1009853

Fig 1.

Enzyme-substrate interaction modeling strategies.

(A) Current machine learning-directed evolution strategies, which involve design-build-test-model-learn cycles measuring protein variant activity on a single substrate of interest. (B) The “dense screen” setting where homologous enzyme variants from one protein family are profiled against multiple substrates. In this setting, we can aim to generalize to either new enzymes (“enzyme discovery”) or new substrates (“substrate discovery”). (C) Three different styles of models evaluated in this study, where single task models independently build predictive models for rows and columns from panel (B), whereas a CPI model takes both substrates and enzymes as input. (D) An example CPI model architecture where pretrained neural networks extract features from the substrate and enzyme to be fed into a top-level feed forward model for activity prediction.

More »

Expand

Table 1.

Summary of curated datasets with the number of unique enzymes, unique substrates, and unique pairs in each dataset in addition to an exemplar structure for the protein family.

More »

Expand

Fig 2.

Assessing enzyme discovery in family wide screens.

(A) CPI models are compared against the single task setting by holding out enzymes for a given substrate and allowing models to train on either the full expanded data (CPI) or only data specific to that substrate (single-task). (B) AUPRC is compared on five different datasets, arranged from left to right in order of increasing number of enzymes in the dataset. Baseline models are compared against multi-task models, CPI models, and single-task models. K-nearest neighbor (KNN) baselines are calculated using Levenshtein edit distances to compare sequences; multi-task models use a shared feed forward network (FFN) to compute predictions against all substrate targets, CPI models utilize FFN with either concatenation (“[{prot repr.}, {sub repr.}]”) or dot product interactions (“{prot repr.}•{sub repr.}”), and ridge regression is used for single-task models. ESM-1b features indicate protein features extracted from a masked language model trained on UniRef50 [20]. Halogenase and glycosyltransferase datasets are evaluated using leave-one-out splits, whereas BKACE, phosphatase, and esterase datasets are evaluated with 5 repeats of 10 different cross validation splits. Standard error bars indicate the standard error of the mean of results computed with 3 random seeds. Each method is compared to the single-task L2-regularized logistic regression model (“Ridge: ESM-1b”) using a 2-sided Welch T test, with each additional asterisk representing significance at [0.05, 0.01, 0.001, 0.0001] thresholds respectively after application of a Benjamini-Hochberg correction. (C) Average AUPRC on each individual “substrate task” is compared between compound protein interaction models and single-task models. Points below 1 indicate substrates on which single-task models better predict enzyme activity than CPI models. CPI models used are FFN: [ESM-1b, Morgan] and single-task models are Ridge: ESM-1b. (D) AUPRC values from the ridge regression model are plotted against the average enzyme similarity in a dataset, with higher enzyme similarity revealing better predictive performance. (E) AUPRC values from the ridge regression model broken out by each task are plotted against the fraction of active enzymes in the dataset. Best fit lines are drawn through each dataset to serve as a visual guide.

More »

Expand

Fig 3.

Assessing substrate discovery in family wide screens.

CPI models and single task models are compared on the glycosyltransferase, esterase, and phosphatase datasets, all with 5 trials of 10-fold cross validation. Error bars represent the standard error of the mean across 3 random seeds. Each model and featurization is compared to “Ridge: Morgan” using a 2-sided Welch T test, with each additional asterisk representing significance at [0.05, 0.01, 0.001, 0.0001] thresholds respectively, after applying a Benjamini-Hochberg correction. Pretrained substrate featurizations used in “Ridge: JT-VAE” are features extracted from a junction-tree variational auto-encoder (JT-VAE) [53]. Two compound protein interaction architectures are tested, both concatenation and dot-product, indicated with “[{prot repr.}, {sub repr.}]” and “{prot repr.}•{sub repr.}” respectively. In the interaction based architectures, ESM-1b indicates the use of a masked language model trained on UniRef50 as a protein representation [20]. Models are hyperparameter optimized on a held out halogenase dataset. AUCROC results can be found in Fig D in S1 Text.

More »

Expand

Fig 4.

Evaluating single-task models on kinase repurposing and discovery tasks.

Kinase data from Davis et al. is extracted, featurized, and split as prepared in Hie et al. Multilayer perceptrons (MLP) and Gaussian process + multilayer perceptron (GP+MLP) models are employed. We add variants of these models without CPI training separate single-task models for each enzyme and substrate in the training set, as well as linear models using both pretrained featurizations (“Ridge: JT-VAE”) and fingerprint based featurizations of small molecules (“Ridge: Morgan”). Spearman correlation is shown for (A) held out kinases not in the training set and (B) held out small molecules not in the training set across 5 random initializations. (C) We repeat the retrospective evaluation of lead prioritization. The top 5 average acquired K_d values are shown for the CPI models in Hie et al. compared against a linear, single-task ridge regression model using the same features. (D) The top 25 average acquired K_d values are shown.

More »

Expand

Fig 5.

Structure-based pooling improves enzyme activity predictions.

(A) Different pooling strategies can be used to combine amino acid representations from a pretrained protein language model. Yellow coloring in the schematic indicates residues that will be averaged to derive a representation of the protein of interest. (i) We introduce active site pooling, where only embeddings corresponding to residues within a set radius of the protein active site are averaged. By increasing the angstrom radius from the active site, we increase the number of residues pooled. Crystal structures shown are taken from the BKACE reference structure, PDB: 2Y7F rendered with Chimera [60]. (ii, iii) We also introduce two other alignment based pooling strategies: coverage and conservation pooling average only the top-k alignment columns with the fewest gaps and highest number of conserved residues respectively. (iv) Current protein embeddings often take a mean pooling strategy to indiscriminately average over all sequence positions. (B) Enzyme discovery AUPRC values are computed for various different pooling strategies. Each strategy is tested for different thresholds of residues to pool, comparing against both KNN Levenshtein distance baselines and a mean pooling baseline. The same hyperparameters are used as set in Fig 2 for ridge regression models. The kinase repurposing regression task from Hie et al. is shown with Spearman’s ρ instead of AUPRC as interactions are continuous, not binarized. All experiments and are repeated for 3 random seeds.

More »

Expand