Physics-based modeling provides predictive understanding of selectively promiscuous substrate binding by Hsp70 chaperones

To help cells cope with protein misfolding and aggregation, Hsp70 molecular chaperones selectively bind a variety of sequences (“selective promiscuity”). Statistical analyses from substrate-derived peptide arrays reveal that DnaK, the E. coli Hsp70, binds to sequences containing three to five branched hydrophobic residues, although otherwise the specific amino acids can vary considerably. Several high-resolution structures of the substrate -binding domain (SBD) of DnaK bound to peptides reveal a highly conserved configuration of the bound substrate and further suggest that the substrate-binding cleft consists of five largely independent sites for interaction with five consecutive substrate residues. Importantly, both substrate backbone orientations (N- to C- and C- to N-) allow essentially the same backbone hydrogen-bonding and side-chain interactions with the chaperone. In order to rationalize these observations, we performed atomistic molecular dynamics simulations to sample the interactions of all 20 amino acid side chains in each of the five sites of the chaperone in the context of the conserved substrate backbone configurations. The resulting interaction energetics provide the basis set for deriving a predictive model that we call Paladin (Physics-based model of DnaK-Substrate Binding). Trained using available peptide array data, Paladin can distinguish binders and nonbinders of DnaK with accuracy comparable to existing predictors and further predicts the detailed configuration of the bound sequence. Tested using existing DnaK-peptide structures, Paladin correctly predicted the binding register in 10 out of 13 substrate sequences that bind in the N- to C- orientation, and the binding orientation in 16 out of 22 sequences. The physical basis of the Paladin model provides insight into the origins of how Hsp70s bind substrates with a balance of selectivity and promiscuity. The approach described here can be extended to other Hsp70s where extensive peptide array data is not available.

Note that backbone hydrogen bonds involving M404, S427 and A429 completely overlap in two orientations." (Figure 2 caption) and "… reveals that the backbone conformations remain conserved and all five hydrogen bonding interactions between the substrate backbone and βSBD are formed in both orientations (thick and thin magenta dashed lines). Importantly, the highly conserved backbone conformation projects the side chains into identical binding pockets regardless of the orientation. For example, Figure 2B illustrates that the slight shift of Cb positions does not affect the placement of the central L into site 0. As such, …" (Page 8, lines 168-174) -In section "Binding site geometry and surface properties determine the DnaK-substrate selectivit", the authors mention predictions from LIMBO and Rudiger, but not ChaperISM.
Response: ChaperISM uses a position Independent Scoring Matrix (PISM), so doesn't contain information about the relative preferences for individual amino acids at different sites. There was a position-specific scoring matrix mentioned in the original publication, but the better-performing standard options are only for the PISM in (higher performance, default) "qualitative" mode and (slightly lower performance, not used in our paper) "quantitative mode".

-On Page 11, line 233, there is no graph or values showing correlation between vdW interaction energies and size of the side chain, only visual inspection of Figure 4 and S4.
Response: To alleviate the apparent need for such a figure, we have referred to the correlation as "rough" to try to make it clearer that the trend isn't critical to our argument (Page 12, line 318), but that the terms as calculated are consistent of what we'd expect. Given that the sidechains are in similar environments, it makes sense that as the number of atoms increases, the number of vdW contacts and thus vdW energy should roughly increase.
-On Page 12, line 235-237, the authors should comment on the possible impact of the length of the simulation (3ns) on the backbone strain term.
Response: Given the highly conserved backbone structures of the complex (e.g., see Figure 2), we do not anticipate large scale conformational re-arrangements in response to different substrate side chain groups. Analysis of the simulation trajectories (up to 50 ns) suggested that local structural relaxation occurred rapidly, and all energetics quickly stabilized within 3 ns (e.g., see Figure S16 and S17). The convergence is further discussed in Methods (see Page 23-24, lines 698-703).
-On Page 15, line 303-305, authors should give details on the data set balance. If the data set is imbalanced, it may be more interesting to show a precision-recall curve. In case the data set is balanced, it would be interesting to show how Paladin would perform in a imbalanced idependent data set (look CD data set at doi:10.1002/prot.25084). Despite this data set came from a different protein, the authors claim that their method can be expanded to other Hsp70 chaperones.

Response:
We thank the reviewer for the excellent suggestion. The current dataset is imbalanced, and we have described the distribution of the four sub-categories in the manuscript (Page 25, lines 773-775). We have also included a new SI Figure S9 to compare the precision-recall curves, which supports that Paladin is comparable to other methods in recapitulating the peptide array data. Additional discussion has been added to the revised manuscript on the new analysis (Page 16, lines 426-435 ). Our current model cannot be extended to other Hsp70s without resampling energy terms with the structure of the new Hsp70, which we respectfully argue is beyond the scope of our current study.
-The AUC ROC values for Paladin, in comparison with Rudiger and ChaperISM, performs worst. The authors have provided possible reasons why this is happening, saying "...the small differences in ROC curves of Paladin and previous models are likely insignificant.". However, none of these reasons were tested.

Response:
The criticism is well taken. Indeed, Paladin is the worst in reproducing the peptide array data but the difference is small. In particularly, The AUC PR values are similar for Paladin, ChaperISM and Limbo for all binders (Table S4, Figure S9) and Paladin yields the best AUC PR value (0.58) among all four methods evaluated for strong binders (0.37-0.52). In the revised manuscript, we delete the statement noted by the reviewer. Instead, we provide the following discussion: "On the other hand, the other models including Paladin perform comparably and the differences in ROC/PR curves are small. Importantly, apparent better fitting to the data do not necessarily reflect a superior model. Instead, by incorporating the physics of molecular interactions and using only a small number of free parameters (Table 1), the Paladin model is designed to be capable of predicting key molecular details of binding including backbone orientation and residue registry. This provides an opportunity to reveal the balance of various interactions from different sites in binding specificity and to be extended to modeling other Hsp70s without extensive peptide array data." (Page 16-17, lines 465-531).
-On Page 20, line 417-418, the authors should also mention how ChaperISM performs with Pro at site 0.
Response: Because ChaperISM is position-independent, this cannot be done. ChaperISM scores a P at any position the same. For reference, P has a negative score in ChaperISM (higher score means stronger binder) so P isn't predicted to be favorable at any site.
-The final PSSM matrices with energetic terms need to be included, as this is common practice for prediction methods.
Response: Thank you for pointing this out. We have now included the values of PSSM in a new SI Table S3. Response: This comment is well-taken, since we do mention the role of ATP/ADP in the functional cycle of DnaK and show the NBD without showing either the molecules themselves or at least the binding sites.
- Figure 2: The interpretation of image B is difficult. I'd recommend (i) different colors for each one of the orientation modes or (ii) two separate images, one with the forward orientation and another one with the reverse orientation.
Response: Thank you for these excellent suggestions to improve the figure. We have considered both of them previously while workshopping this figure extensively over the last year prior to submission. The problems with these suggestions are that: i) Different colors erase the identity of the substrate atoms, which is really crucial detail, and ii) separate images make it difficult to see the striking similarities and symmetric differences between both orientations which are evident in the overlap. Instead, we have now included a new Supplementary Movie S1 to more clearly show the conserved backbone and side chain interactions in forward and backward orientations.
- Figure 3: There is room for improvement in this image, specially in respect to the surfaces and the representation of the peptides (e.g. ribbon vs stick; colors; transparency).

Response:
We agree with the challenge of illustrating these shallow binding surfaces in static 2D figures. This again is a figure that we have revised extensively over time and there seems to be no perfect combination of angle, rendering and coloring. Instead, we have now included a new Supplementary Movie S2 to more clearly show the site 0 binding pocket.
- Figure 4: I suggest to increase the space in between the charts. Also, the legend is interfering with the data in some cases (e.g. Backbone Conformational Propensity). I suggest to move the legend to another place. Additionally, what FE stands for? This should be written in the legend.

Response:
We have expanded FE to Free Energy, fixed the transparency of the legends and adjusted the spacing slightly to minimize clashes.
- Figure S2: Label the residues in the figure. This is a important figure, however, the representation and labels are not adequate to show the relevant point.
- Figure S3: Label the residues in the figure.
Response: Both figures have been updated to include the labels.
- Figure S4: The graph for "Backbone Conformational Propensity" is the same in A-D. Is this correct? If yes, why is so different from site 0?
Response: Thanks for catching the error. The main text Figure 4 is correct, Figure S4 was a mistake we forgot to update during preparation.
- Figure S8: The legend of the figure do not explain what the star and the spheres represent. "Correct" and "other" may not be the better label, unless explained in the legend.
Response: Thanks, the figure caption has been clarified to explain the use of stars and circles.
- Figure S12: Label the residues in the figure.
- Figure S13: Please, correct y-axis titles appropriately (e.g. type of energy and unit).
- - Table S1: Include a footnote to explain "Z" amino acid corresponds to cyclohexyalanine.
Response: All figures and tables have been updated as suggested. We greatly appreciate the reviewer's thorough reading of both the main text and SI. -one section explaining how the comparison among methods (limbo, rudiger, chaperism, etc) was performed. Also, explain the distribution of the data on the used datasets (e.g. balanced vs imbalanced).

Summary
In this work, the authors develop a physics-based algorithm to predict binding sites of DnaK, the Hsp70 from E. coli. They posit their algorithm is unique in comparison to existing tools in that it is able to predict orientation (Forward versus reverse) as well as registry (i.e. the specific site occupied by each substrate residue in the DnaK substrate-binding-domain). In their first set of results, the authors perform a qualitative analysis of a set of previously determined DnaK structures by Zahn and colleagues, and discuss which residues are represented in each position in the substrate and what the underlying energetics are. They also cross-reference these data with LIMBO, a comparable DnaK-substrate prediction algorithm.

Next, the authors describe and validate their own prediction algorithm, Paladin. They show that their algorithm performs similarly to existing algorithms in terms of predicting DnaK-binding peptides in a dataset generated by Rüdiger et al. They further show that Paladin can predict registry of forward-binding peptides, and can predict whether a peptide will bind in the forward or the reverse orientation based on a relatively small set of DnaK-substrate structures.
The authors conclude that the algorithm they have designed is a good DnaK substrate predictor, and given the relatively small extent to which the algorithm was parameterized on the experimental data set, they predict the algorithm to be easily transferable to similar systems.

General
The physics-based algorithm described here adequately predicts DnaK-binding sites, and performs comparably to existing tools for the same purpose. Their use of a 5-residue binding site allows the authors to test for specific substrate registry, in which Paladin outperforms some of its competitors. The algorithm also predicts binding orientation reasonably well, although the authors do not compare the performance therein with other predictors. The explicit use of individual force terms allows for an assessment of which forces drive binding of certain peptides, which is not the case for predictors such as the one devised by Rüdiger, but is comparable to LIMBO. Another advantage of the algorithm seems to be its transferability to similar systems, although the authors should test this before claiming it. Furthermore, the authors provide an interesting analysis of how and why DnaK can interact with substrates in two opposite orientations.

Overall, the authors describe a tool that satisfactorily predicts DnaK binding sites, their binding orientation, and in specific cases, their registry. Although it is not entirely novel and does not massively outperform competitors, the authors describe an additional approach to tackling the important question of DnaK-substrate interactions, and yield additional insight into binding orientation and registry.
Response: Thank you kindly for your constructive and thoughtful remarks.

Major remarks
Line 132: in order to determine how conserved the SBD conformation is, the authors align the structures from Table S1 and determine the RMSD to be small. However, they mention that they only look at structures with substrates in the forward orientation AND structures in which the substate does not contain Pro. Does this mean they have omitted 13 structures out of the 18 forward-oriented structures? In that case, this analysis hardly shows the SBD structure to be highly conserved. Also, does this mean the authors find the SBD needs substantial conformational changes in order to accommodate Pro? This would be interesting information and the authors should comment on this.

Response:
The backbone RMSD values of SBD conformations are provided in Figure S1A for all complexes (with and without prolines and regardless of orientation), which are very small (< 2 Å). That is, no substantial conformational change is required for SBD to accommodate different peptide substrates, including proline-containing ones. We note that the presence of proline residues does lead to substantial local movement of the substrate backbone (~1-2 Å). We have revised the manuscript to make this clearer (Page 7, lines 141-144).
Line 146: The authors mention that the SBD conformation is highly conserved between forwardand reverse-oriented substrates, though they never test for this since they don't use reverseoriented substrates in their RMSD analysis. It would be useful to include such an analysis to show that not only the substrate sidechains are located in roughly the same positions, but also the SBD remains in largely the same conformation.

Response:
We have updated Figure S1A to include the backbone RMSD values of SBDs of all crystal structures, regardless of the substrate orientation. There is no correlation between orientation of the substrate and binding configuration of the SBD. We have also emphasized the more important observation (for our recycling of forward-orientation energy terms) that the site 0 sidechain projects into similar sites regardless of orientation.
In their construction of a physics-based model, the authors consider 6 physical forces, but don't mention hydrogen bond formation at all. Could the authors comment on why they omit a crucial force in protein-protein interaction from their model, and why they feel this choice is justified?
Response: We fully agree that hydrogen bonding is a crucial kind of interaction. Modern generalpurpose empirical protein force fields (including CHARMM22, used in this work) don't explicitly include a hydrogen bonding term. Instead, hydrogen bonding is implicitly described using electrostatic and van de Waals interactions. This is in contrast to hybrid empirical/statistical energy functions such as FoldX or Rosetta, which often do contain such terms explicitly. Therefore, there is no omission of critical hydrogen bonding effects in this work.

Line 242-249: How do the authors reconcile the fact that they see only a minor preference for Lysine and no preference for Arg whatsoever, although preferences for both basic references have been described in literature, including by Rudiger et al, and that Arg is represented in many DnaKsubstrate interaction structures?
Response: There are two main sources for this apparent limitation of Paladin, namely the RDIE electrostatic model and SASA-based solvation term. In particular, the current model uses the solvation free energy of each complete residue, whereas residues such as Arg and Lys might more accurately be treated as a hydrophobic sidechain and polar head. We did not attempt to treat these sidechains in this fashion in the final model, mainly due to lack of experimental data on such decomposition and concern of overfitting the peptide array data. For more discussion about the limitation of the RDIE model, please see Response to Reviewer #3, comment 2a. We have more strongly emphasized these limitations where appropriate in the manuscript (e.g., page 17-18, lines 556-580 and page 24, lines 795-797).
Figure 2 caption: "hydrogen bonds are shown in magenta with matching size to their corresponding substrate". This is a little unclear, I assume the sizes of the hydrogen bond lines are thick/thin depending on whether they are formed with forward/reverse orientation. The differences are hard to make out in the figure

Also, the authors state this is a problem, but don't offer a solution for it. Would this offer an explanation as to why Arg is not favored in their algorithm, yet has been found in binders in vitro, as well as in DnaK-substrate structures?
Response: Please see Response to Reviewer #2, critique "Line 242-249 …" above. Figure S7C: there seems to be a yellow square in the top right hand corner of each of the plots, which it seems is not supposed to be there?
to pinpoint the binding registry at the residue level. We have updated the language in our manuscript to more accurately reflect that the information in the peptide array does contain some information about register but with ambiguity (Page 14, lines 410; page 22, lines 710-711).
We have included two new supplemental figures which address Paladin's ability to reproduce the peptide arrays. 1) Figure S8 contains a) the discrete peptide array data, b) the Rudiger model's predictions for comparison, and c) Paladin's predictions. The scores have been normalized to maximize readability. 2) Figure S9, which contains Precision-Recall curves, which more faithfully account for classification accuracy and error in this particular dataset (see Reviewer #2 comment and response starting with "-The AUC ROC values for Paladin," for more information). The new PR curve and example peptide array predictions provide a similar story that the Rudiger model, as one would expect, fits the peptide array data it was derived from best, and that Paladin does reproduce the peptide array data reasonably well, but like the other predictors is noisy. The ROC and PR metrics provide a succinct and quantitative way to summarize this fitting.

Some parameters in the prediction matrix do not fit previous experimental data.
a. The negative effect of acidic residues in the 5-residue binding site only has a moderate effect, while positive charges are dramatically disfavoured. This is contrast to both the statistical composition of the binding peptides and the finding that negatively charged residues wipe out binding, while positively charged residues do not. This suggests that the MD analysis of the present study does not adequately takes charges into account.

Response:
We thank the reviewer for the insightful comment. We agree that the current Paladin model has apparent limitations in recapitulating the binding preference of charged residues, despite the apparent overall success. This may be attributed to two key approximations, namely RDIE electrostatics and SASA solvation model. Please refer to our response above to Reviewer #2, critique "Line 242-249: … " (pages 8-9 of this response letter) for additional details. In addition, we note that these approximations are adopted not only for computational efficiency, but also to allow full pair-wise decompositions of all energy terms required for deriving a PSSM. For the more accurate generalized Born (GB) model, the effective Born radii depend on the conformation of the whole complex and the interaction energies can not be rigorously decomposed into a sum of side chain/site interaction contributions. We have added this discussion to the Methods section. (Page 23, lines 757-762) b. The structural basis of the predictor is based on the Hendrickson structure of the DnaK substrate binding domain. This study describes that the central position 0 is tailored for Leucine. This is consistent with the analysis of the composition of the residues in this region, which shows Leu strongly favoured over all other residues. This is not the case for the predictor here, and it is unclear why. Here Ile is almost as good as Leu.

Response:
We agree with the reviewer that it is somewhat puzzling that our analysis suggests that Leu and Ile would be comparably compatible with site 0. We comment on this in the text with respect to Paladin's inability to identify the flip of orientation between NRLLLTG (forward) and NRLILTG (reverse), which keeps L at site 0 in both cases (Page 21, lines 671-674). In addition, we note that the complementary experimental work in the Gierasch lab and previous experimental