Positional SHAP (PoSHAP) for Interpretation of machine learning models trained from biological sequences

doi:10.1371/journal.pcbi.1009736

Fig 1.

Overview of data, modeling, and positional SHAP analysis for model interpretation.

Peptide sequence and output data was downloaded from Haj et al. 2020, Hu et al. 2019, and Meier et al. 2021, and used as an input for three separate deep learning models. The peptide sequences were numerically encoded, split to positional inputs, and Long Short-Term Memory (LSTM) models were trained to predict each of the outputs. These outputs included the five peptide array intensities for the Mamu MHC allele data, IC50 binding data for the human MHC A*11:01 data, and CCS for the mass spectrometry data. The trained models were then used to make predictions on a separate test subset for each of the datasets. Finally, the model interpretation method SHAP was adapted to enable determination of each amino acid position’s contribution to the final prediction. This PoSHAP analysis was visualized by plotting the mean SHAP value of each amino acid at each position as a heatmap.

More »

Expand

Fig 2.

LSTM Model Performance.

Held-out test peptides were input to the models and predictions were plotted against true experimental values. (A) For the Mamu allele multi-output regression model, predicted and experimental intensities were compared. (B) For the A*11:01 model, predicted and experimental IC50s were compared. (C) For the collisional cross section model, predicted and experimental collisional cross sections were compared. For each model, predicted and experimental values were compared with the Spearman’s rank correlation and all demonstrated a significant (p-value < 1E-145) positive correlation (rho>0.6).

More »

Expand

Fig 3.

Heatmaps showing PoSHAP analysis to determine amino acid binding motifs from deep learning models.

The mean SHAP values for each amino acid at each position across all peptides in the test set were arranged into a heatmap. The position in each peptide is along the y-axis and the amino acid is given along the x-axis. “End” is used in positions 9 and 10 to enable inputs of peptides with length 8, 9, or 10. For comparison, the SHAP force plot for the peptide with the highest binding prediction is shown below each allele.

More »

Expand

Fig 4.

PoSHAP interpretation of models trained to predict A*11:01 binding or CCS.

The mean SHAP value for each amino acid across all test peptides were calculated and arranged into heatmaps representing the values for (A) A*11:01 and (B) CCS. The position along the peptide is along the y-axis and each amino acid is listed along the x-axis.

More »

Expand

Fig 5.

Amino acid summary statistics differ from PoSHAP values for the CCS data.

(A) Amino acid counts as a function of position for training data. (B) Procedure for picking the ‘top peptides’ with the highest CCS. Linear regression was performed on the peptides ranked by their actual CCS value. Any peptide that fell above the trendline and overall mean were defined as ‘top peptides’. (C) Counts of amino acids for the top peptides were summarized in a heatmap. (D) Mean SHAP values across amino acids and positions from PoSHAP analysis.

More »

Expand

Fig 6.

SHAP dependence plots for allele Mamu-A001 show how relationships between sequential amino acids contribute to binding.

Each graph represents a pair of positions in the peptide, here (A) positions one and two and (B) positions two and three. The x-axis lists each possible amino acid for that position and the y-axis shows the SHAP value. Each point represents a peptide with the listed amino acid at that position on the x-axis and the amino acid in the subsequent position is shown by color. This shows how the range of SHAP values for a particular amino acid at a specific location is reflective of the dependence of other amino acid positions.

More »

Expand

Fig 7.

Dependence analysis of CCS model.

(A) Significant (Bonferroni corr. P-value < 0.05) values were taken from the interpositional dependence analysis and the difference in the mean between the interdependent amino acids SHAP values and the remaining amino acids at each compared position pair were grouped based on the distance between the dependent interaction, (B) the category of interaction, or (C) distance and interaction category. Categories are labelled by the following for the combined bar plot and heatmap: ζ<< = charge attraction, ζ<> = charge repulsion, * = other, and δ = polar. For the distance analysis, interactions were grouped into three categories, neighboring (distance = 1), near (distance = 2, 3, 4, 5,6), and far (distance = 7, 8, 9). * indicates significance (ANOVA with Tukey’s post hoc test p-value < 0.05). For the interaction categories in (B) and (C), each interaction was grouped by the expected type of interaction between the two amino acids. Significant differences between interaction types are noted by the pairing by lines (ANOVA with Tukey’s post hoc test p-value < 0.05). (D) Significant differences between combined categories are illustrated by the heatmap where significant values (ANOVA with Tukey’s post hoc test p-value < 0.05) are designated by colors other than purple. Exact p-values for each are provided in S4 Table. repulsive molecular interactions, including charge repulsion and “other” interactions (likely steric interactions or interactions between the termini) increased predicted CCS. Notably, there were very few significant hydrophobic interactions. This may reflect that hydrophobic interactions between amino acids in a peptide act to minimize contact with a polar solvent, rather than acting as an attractive force itself. Peptides lose polar solvent (water) during the electrospray process, which may prevent significant hydrophobic interactions, which might contradict prior work [64].

More »

Expand

Fig 8.

CCS PoSHAP of Various Machine Learning Models.

PoSHAP analysis was performed on two additional machine learning models, Extra Trees, and Extreme Gradient Boosting (XGB). (A) Predictions were plotted against experimental values and the Mean Squared Error and r values are reported for each model. (B) PoSHAP heatmaps were created for each model, standardized by the highest value in each heatmap, illustrating an increase in model complexity as more sophisticated models are used. Dependence analysis was performed on each model and the significant interactions are plotted by (C) distance and by (D) combined distance and interaction type.

More »

Expand