Machine learning modeling of family wide enzyme-substrate specificity screens

doi:10.1371/journal.pcbi.1009853

Machine learning modeling of family wide enzyme-substrate specificity screens

Fig 5

Structure-based pooling improves enzyme activity predictions.

(A) Different pooling strategies can be used to combine amino acid representations from a pretrained protein language model. Yellow coloring in the schematic indicates residues that will be averaged to derive a representation of the protein of interest. (i) We introduce active site pooling, where only embeddings corresponding to residues within a set radius of the protein active site are averaged. By increasing the angstrom radius from the active site, we increase the number of residues pooled. Crystal structures shown are taken from the BKACE reference structure, PDB: 2Y7F rendered with Chimera [60]. (ii, iii) We also introduce two other alignment based pooling strategies: coverage and conservation pooling average only the top-k alignment columns with the fewest gaps and highest number of conserved residues respectively. (iv) Current protein embeddings often take a mean pooling strategy to indiscriminately average over all sequence positions. (B) Enzyme discovery AUPRC values are computed for various different pooling strategies. Each strategy is tested for different thresholds of residues to pool, comparing against both KNN Levenshtein distance baselines and a mean pooling baseline. The same hyperparameters are used as set in Fig 2 for ridge regression models. The kinase repurposing regression task from Hie et al. is shown with Spearman’s ρ instead of AUPRC as interactions are continuous, not binarized. All experiments and are repeated for 3 random seeds.

doi: https://doi.org/10.1371/journal.pcbi.1009853.g005