Ten quick tips for sequence-based prediction of protein properties using machine learning

doi:10.1371/journal.pcbi.1010669

Ten quick tips for sequence-based prediction of protein properties using machine learning

Fig 7

Various ways to interpret your model.

(A) Simply checking (cor)relation between class labels and certain features of interest; the example shows how the pattern of conservation differs between surface (Sur), interface (Inter), and buried (Bur) residues for protein–protein (top, PPI) and epitope interactions, which are a specific type of PPI (bottom); based on data from Hou and colleagues [37]. (B) GINI feature importance, which is a simple measure per feature indicating how much each contributes to the predictions, here shown as a heat-plot (features as rows) for 5 different models for PPI and epitope prediction (columns) [20]; one may appreciate how some features are prominent across models (bright red across), while others appear to be more model specific (only red in some columns). (C) SHAP plot, which is an additive score that estimates per datapoint (single points) and per feature (along vertical axis) what the effect of the feature value (color) is on the prediction (horizontal axis) [17]; most features can be seen to have many strong effects, both positive and negative. The top feature is sequence length, then amino acid type (AA), and the other features are related to the PSSM profile per amino acid type and propensities for coil (PC) or helix (PA); for most a window mean (WM) is taken.

doi: https://doi.org/10.1371/journal.pcbi.1010669.g007