A Unified Conformational Selection and Induced Fit Approach to Protein-Peptide Docking

Protein-peptide interactions are vital for the cell. They mediate, inhibit or serve as structural components in nearly 40% of all macromolecular interactions, and are often associated with diseases, making them interesting leads for protein drug design. In recent years, large-scale technologies have enabled exhaustive studies on the peptide recognition preferences for a number of peptide-binding domain families. Yet, the paucity of data regarding their molecular binding mechanisms together with their inherent flexibility makes the structural prediction of protein-peptide interactions very challenging. This leaves flexible docking as one of the few amenable computational techniques to model these complexes. We present here an ensemble, flexible protein-peptide docking protocol that combines conformational selection and induced fit mechanisms. Starting from an ensemble of three peptide conformations (extended, a-helix, polyproline-II), flexible docking with HADDOCK generates 79.4% of high quality models for bound/unbound and 69.4% for unbound/unbound docking when tested against the largest protein-peptide complexes benchmark dataset available to date. Conformational selection at the rigid-body docking stage successfully recovers the most relevant conformation for a given protein-peptide complex and the subsequent flexible refinement further improves the interface by up to 4.5 Å interface RMSD. Cluster-based scoring of the models results in a selection of near-native solutions in the top three for ∼75% of the successfully predicted cases. This unified conformational selection and induced fit approach to protein-peptide docking should open the route to the modeling of challenging systems such as disorder-order transitions taking place upon binding, significantly expanding the applicability limit of biomolecular interaction modeling by docking.


Introduction
Among the wealth of protein-protein interactions that decide of the cell's fate, peptides play a crucial role and account for about 40% of them [1].From co-activators to inhibitors, they are involved in many signaling and regulation pathways and have been identified to interact with a large number of protein domains.MHC, SH3 and PDZ domains are for instance well known for their affinity toward peptide binding [2][3][4].This large diversity of functions and the importance of the many biological pathways they mediate make them prone to be associated with diseases [5].An emergent field in drug design focuses on the development of peptides for therapeutic applications [6].Peptides have advantages over small-molecule inhibitors in that they can mimic proteinbinding domains and are large enough to competitively inhibit protein-protein interactions.Pharmaceutical leads include for example antimicrobial peptides [7,8], cyclic peptides [9] and also beta-breaking peptides that can inhibit amyloid fibril formation [10][11][12].Another promising application field is that of fusogenic peptides used as cargo to deliver drugs to target cells [13].
Despite the large amount of data scientists have gathered over protein-peptide interactions [14,15], structural determination of their complexes remains challenging due to two major obstacles: peptides are highly flexible and they often interact weakly with their substrate, underlining their importance in signal transduction or regulation which often relies on transient processes.These obstacles make experimental structure determination often nontrivial and call for complementary computational approaches like biomolecular docking.From a modeling perspective, conventional algorithms implemented either for protein-ligand or proteinprotein docking are also often struggling with the problem of flexibility [16].
Few methods have been published to date to model peptides onto their protein receptors.Initial applications focused on specific protein families or domains involved in peptide recognition [2,[17][18][19][20], or were restricted to very short peptides [21].Molecular dynamics simulations have also been used to predict proteinpeptide interactions [22] but, even if providing interesting insights about the association process, they were only benchmarked against small sets of complexes and their applicability for the systematic screening of protein-peptide interactions remains to be demonstrated.FlexPepDock [23] was the first generic algorithm aiming at modeling near-native protein-peptide complexes, starting either from an ensemble of perturbed peptide structures or, with significantly less successful results, from an extended backbone conformation.FlexPepDock, which is also available as a webserver, assumes knowledge of the binding site (anchor residues) to build and refine the peptide onto its receptor [24].When no information about the peptide backbone conformation is available, the same authors have proposed a much more computationally demanding pipeline [25] that combines Rosetta ab-initio predictions to 'fold' the peptide and FlexPepDock to refine the binding mode.Our own information-driven flexible docking approach HADDOCK [26,27] has also been used in the past to model protein-peptide interactions, e.g.[28][29][30][31][32].In HADDOCK, the docking is driven by (experimental) knowledge in the form of information about the interface region between the molecular components and/or their relative orientations, with applicability in protein-ligand, protein-nucleic acid and protein-protein docking predictions.HADDOCK can further handle simultaneously up to six molecules of various natures [33].This integrative approach has proven its success in the blind international experiment CAPRI (Critical Assessment of PRedicted Interactions) [34] where it stands as one of the top-performing methods [35].HADDOCK is also available as a web server [36], facilitating its usage for large community.Despite its successful application to the prediction of various protein-peptide complexes, HADDOCK's performance for the modeling of protein-peptide interactions had not yet been systematically benchmarked nor had its protocol been thoroughly optimized for this purpose.
Development of a successful method for the modeling of protein-peptide interactions should also consider aspects of their molecular recognition mechanism.Over the last century several theories have been put forward to explain the molecular recognition process [37][38][39][40][41], among which induced fit [37,38] and conformational selection [39][40][41].Conformational selection postulates that, already in the absence of its ligand, a protein exists in a number of discrete conformational states in equilibrium, including the one that preferentially binds the ligand.This concept is opposed to the induced fit theory, primarily introduced to describe enzyme action, which states that the conformational fit is induced by substrate binding.In recent years, a shift toward a reconciliation of both models can be observed where conformational selection and induced fit may be in fact co-existing [42][43][44][45].
Here we present an optimized HADDOCK protocol for flexible protein-peptide docking that combines conformational selection and induced fit recognition mechanisms.The performance of this approach is demonstrated on a non-redundant protein-peptide benchmark, the peptiDB dataset [46].The latter consists of both naturally occurring peptides and short segment of proteins, mainly loops or disordered regions that fold upon binding.It was originally developed to test the FlexPepDock algorithm [23].We demonstrate that, using a coarse definition of the interacting surface on the unbound protein receptor and no information on the peptide side, HADDOCK is able to generate near-native or sub-angstrom models for ,70% of the dataset in unbound/ unbound docking.

Results
We have developed an efficient protein-peptide docking protocol that combines conformational selection with induced fit, capitalizing on two of HADDOCK's features: i) its ability to provide ensembles of structures as starting point for the docking, and ii) its flexible refinement capabilities allowing for both backbone and side-chain flexibility.This protocol was optimized making use of the PeptiDB protein-peptide benchmark dataset [46] consisting of 103 non-redundant complexes, 62 of which also have the unbound form of the protein available.In the proteinprotein docking field, the difficulty of a target is usually assessed by measuring the deviation between the unbound and bound forms of its constituents [34].However, this measure is inapplicable within the context of protein-peptide docking since we usually don't have access to the free form of the peptide.Therefore, in order to define the difficulty of each target, we measured instead the deviation in terms of backbone RMSD of the bound form of the peptide from an ideal extended conformation (see Methods).Three different classes were defined (easy/medium/difficult) (Figure 1A).The conformational changes on the protein side are, in first instance, not taken into account in this classification as most peptides were observed not to induce any significant conformational changes on their partner upon binding [46].
Since HADDOCK is an information-driven docking approach, in order to drive the docking, we defined a large binding site on the protein receptor derived from the interacting residues within 5 A ˚from the peptide in the crystal structures of the complexes (see Methods).On average, this maps a surface of 1200 A ˚2 on the protein side, whereas the average interface area in protein-peptide complexes is only about 500 A ˚2 (see Table A in File S1).This rather broad definition of the binding region was chosen to drive the docking to an approximate location of the peptide binding site without introducing too much bias in our results by defining the tight pocket or groove that accommodates the peptide.While this represents of course a ''best case'' scenario where the binding regions are rather well defined, this allows to concentrate on the problem of addressing properly the peptide flexibility in the docking process.

Bound/Unbound (Extended) Docking -Impact and limitations of Flexible Refinement
We first evaluated the performance of HADDOCK in docking and refining extended peptides.Based on the solvent accessible surface areas of the peptide in the crystal structures of the complexes (Figure 1B), in half of the cases the peptide targets a hollow surface on the protein and in another half, the peptide remains largely exposed to the solvent.Especially for the hollow binding sites, flexible refinement is crucial in generating proper poses.Rigid body docking only (it0 stage of HADDOCK) shows a success rate of 54% of acceptable (sub-angstrom or near-native) solutions (Figure 2A); while a subsequent refinement with both side-chain and backbone flexibility enables an induced fit of the peptides in their binding pocket, resulting in an 18% increase in performance leading to 72% of acceptable solutions over the benchmark.However, while peptides that bind onto their protein receptor in a stretched conformation (typically SH3 domains) can be successfully predicted, about 60% of the peptides binding in a helical conformation fail, which represents 44% of the failed cases (12 out of 28).Clearly, folding of a helix upon binding is far beyond the possibilities of the flexible refinement in HADDOCK.Other failed cases consists of deeply buried peptides and peptides with more complex and localized conformational changes including b-hairpins and turn-like conformations.The impact of the flexible refinement in terms of fraction of native contacts (Fnat) and RMSD improvements will be discussed later below.

Bound/Unbound Docking -Introducing Conformational Selection
Given the failure to model helical peptide conformations starting from extended stretches, we revised our approach to allow for conformational selection at the rigid-body docking stage of HADDOCK.Among 103 complexes, 21 peptides bind onto their receptor in a helical conformation, 42 as extended or beta and the remaining 38 as disordered.This proportion reflects what other studies reported about the conformations adopted by peptides upon binding [1].Building onto the ensemble docking capability of HADDOCK we started the docking from an ensemble of three distinct conformations of the peptide: a-helix, polyproline II and extended.Together, these three conformations cover about 80% of the observed peptide bound structures in the Protein Data Bank [47].This combined conformational selection and induced fit protocol led to an increased success rate of 76.3% among the final models after final water refinement (Figure 2B).Not only does the performance improves, but also the quality of the generated models with 23 sub-angstrom high quality solutions compared to 17 for the induced fit approach only (Figure 2B).
In the cases where a helical conformation was selected, we however observed for some targets distortion of the helical conformation after flexible refinement.To correct this, we introduced ''on-the-fly'' backbone dihedral angle restraints for helical regions by allowing angle variations around the measured dihedral angle of 610u.These restraints corrected the loss of secondary structure, improving substantially the overall success rate to 79.4% (Figure 2C).More specifically, for the helical cases, the success rate increased from 50% without restraints to 65% with the new dihedral angle restraints.This is also reflected in an overall improvement of the interface-RMSD (i-RMSD) by ,1 A ånd ,10% increase in the total number of acceptable models.

How Successful is Conformational Selection?
Despite the demonstrated performance improvement, how successful is conformation selection in recovering the proper peptide conformation for a given complex?During rigid-body docking, 6000 models are written to disk.Each model is effectively the results of 10 minimization trials (five, with for each automatic sampling of the 180u rotated solution), starting from one of the three conformations.Each conformation is thus represented equally in the total pool of models and it is up to scoring to selected the relevant models since the top 400 only is further refined.
The HADDOCK score at the rigid body stage is a combination of restraint, van der Waals, electrostatic and desolvation energies, together with a buried surface area term (HADDOCK rigidscore = 0.01 E rest +0.01 E vdw +1.0 E elec +1.0 E desol -0.01 BSA) [26,27].We assessed the pertinence of our scoring scheme for selecting the best conformation after it0 with respect to the peptide conformation in the complex.For the 19 helical cases, HAD-DOCK selected a majority of models coming from a starting helical conformation (Figure 3A) with 60% of the top 400 ranked models being in a helical conformation (corresponding to an enrichment factor of 1.80).The HADDOCK score performs also well for extended peptides since a majority of the selected models (,50%) are generated from an ideal extended conformation of the peptide (Figure 3B) with 46% for the top 400 ranked models being in an extended conformation (corresponding to an enrichment factor of 1.32).Unsurprisingly, for cases with a disordered conformation of the peptide in the crystal structure, all three conformations are homogeneously selected from our input ensemble, with a slightly smaller contribution of polyproline II conformations (Figure 3C).Note that we also investigated whether our scoring function at the rigid body stage could be optimized to improve the selection performance by trying to maximize the number of acceptable models in the top 400 as described by [48].Since, no significant improvement could be found, the weights were kept at their default values, which also correspond to the defaults settings for protein-protein and protein-DNA docking.This has the advantage that the same scoring function can be used for various molecule types or mixtures thereof.

Unbound/Unbound Docking -A Challenging Task
Having established an efficient protein-peptide docking protocol using the bound form of the receptor (bound/unbound docking), we put it to the test on the real case of unbound docking, i.e. starting from the unbound form of the receptor protein and three conformations of the peptide.The original PeptiDB benchmark contains 47 cases with available unbound structures.Further analysis allowed us to identify 15 additional unbound structures, resulting in a total of 62 unbound/unbound cases (12 of which with helical peptide conformations).
To assess the difficulty of the docking in the presence of the unbound receptor, we measured, next to the conformational changes of the peptide itself (see above), the conformational changes occurring between the bound and the unbound forms of each protein at the interface.Eight out of the 62 unbound structures undergo conformational changes upon binding larger  B in File S1), with a maximum of 11.5 A ˚.Such conformational changes might be large enough to theoretically increase i-RMSD values above our acceptable limit even when the peptide is perfectly modeled onto the protein.
Applying our conformational selection/induced fit docking protocol led to an impressive overall unbound/unbound docking success rate of 69.4%, meaning that 43 out of 62 cases presented acceptable models in the final stage of HADDOCK (Figure 4).Among the 19 cases that failed, eight complexes correspond to cases where the protein undergoes conformational changes above 2.0 A ˚and four to cases with conformational changes between 1.5 and 2.0 A ˚. We can also assess the quality of the modeling for the peptide independently from changes on the protein, by calculating the RMSD on the interface residues of the peptide after fitting on the interface of the protein receptor (ligand interface RMSD; l-i-RMSD).Using this measure, we got acceptable models for 65% of the cases when considering a 2.0 A ˚threshold for near-native solutions.This increases to 83% for a 2.

Determinants for Success: Ranking and Clustering
HADDOCKs' performance is very promising with acceptable models in the top 400 for ,70% of the cases in unbound/ unbound docking.How well are those models however scoring?To evaluate this we assessed the ranking performance of HADDOCK as a function of the top-scored N models (N ranging from 1 to 400) (Figure 5).This analysis reveals that we only reach 50% success rate when taking into consideration the top 20 models considering all cases.This underlines the difficulty of scoring consistently protein-peptide predictions as reported by a previous study [49].On the other hand, the success rate only improves by ,10% when going from top 50 to top 400, indicating that our scoring function can still reasonably well discriminate the 'true negative' (inaccurate models that have a lower score).
Ranking of individual models is however not the standard scoring procedure in HADDOCK: scoring is usually performed after clustering of the solutions and the final scoring is calculated on a per-cluster basis as the average score of the top 4 ranking models of each cluster.This has the advantage of smoothening the rather noisy contribution of individual energy terms, and in particular the electrostatic energy.Cluster-based ranking successfully ranks a near-native cluster at the top in ,50% of the successful cases (cases for which HADDOCK generated at least one acceptable model in the top 400), and this quickly reaches 75% if the top three clusters are considered (Figure 6).For comparison, single structure scoring only ranks an acceptable model at the top of 21% of the cases (44% if we consider the top 3 ranking models) for which at least one acceptable model was generated.A few examples of unbound/unbound docking models obtained for challenging cases are illustrated in Figure 7.

Discussion
Through this study, we developed a specific protocol for the flexible docking of short peptides (5215 amino acids) onto proteins using HADDOCK.The protocol starts from an ensemble of three different conformations for the peptide, inspired by the conformational selection mechanism.This canonical ensemble does not aim at reproducing the free state conformations sampled by the peptide, but rather represents conformations often observed in protein-peptide complexes.Out of these, the favorable conformations selected at the rigid body stage are then subjected to an enhanced fully flexible refinement onto the protein, following the concept of induced fit.This approach enables HADDOCK to generate near native models for ,80% of the cases for bound/ unbound docking and ,70% of the cases for unbound/unbound docking, and this, using the largest benchmark assembled to date.Furthermore, the HADDOCK cluster-based scoring scheme is shown to be efficient in retrieving an acceptable structure among the top three clusters in 75% of the successfully predicted cases, both for bound/unbound and unbound/unbound benchmarks.This represents quite a solid performance, especially considering that our method reaches an overall success rate of 93% (Figure B in File S1) when applied to the bound/bound dataset, which represents the ideal case and thus defines the upper limit of achievable success rate.Only seven cases among 101 were out of reach for HADDOCK because the peptides were deeply buried in the protein in the crystal structure.A drop of only 13.6% in success rate is observed from bound/bound to bound/unbound when we apply our ensemble-based flexible docking approach, and this is 23.6% for unbound/unbound.The performance of HADDOCK is however less impressive when considering the proportion of cases with sub-angstrom resolution models: it drops from 75% for bound/bound to 25% for bound/unbound docking, reaching only 17% in the case of unbound/unbound docking.This is still acceptable considering the difficulty of the problem.Together, these results confirm the relevance of our unified conformational selection/induced fit approach to the prediction of the 3D structure of protein-peptide complexes.
To further analyze our performance, we distinguished three levels of difficulties (easy/medium/difficult) for the docking based on the deviation between the bound form of the peptide and an ideal extended conformation (Figure 1A).This classification gives a direct indication about the amplitude of the conformational change required to fit the peptide at the interface, starting from an extended conformation.Interestingly, HADDOCK is not only able to provide reliable models for the easiest category of cases, but achieves also high success rates for the medium and difficult categories.Indeed, for the bound/unbound benchmark acceptable models are obtained in 87%, 79% and 76% of the easy, medium and difficult cases, respectively (Figure 2C -right panel).These percentages become really surprising for the unbound/unbound docking, with 92%, 57% and 71% of success rate for the respective easy, medium and difficult categories.This indicates that our protocol is not limited to the ''easy'' cases where the peptide binds as a stretched conformation (for example on SH3 domain) but can deal with more challenging cases as well.
We analyzed the impact of the flexible refinement stages of HADDOCK on the quality of the generated acceptable models after water refinement, considering all unbound/unbound benchmark cases.The interface-RMSD improves by 0.7 A ˚on average during the semi-flexible refinement (it1), and by up to 5.0 A ˚for some cases (Figure 8A), while the change is only moderate (0.02 A ˚on average with a maximum of 0.42 A ˚) during the water refinement of HADDOCK (Figure 8B).The fraction of native contacts improves by 0.25 on average during it1, and by up to 0.79 in some cases (Figure 8C).Some substantial improvement is still observed during the water refinement for some models with a maximum of 0.28 (Figure 8D).All together, this shows that flexibility of the system is mostly handled during the flexible refinement stage (it1) of HADDOCK, while water refinement has the most impact on the fraction of native contacts and on the energetics.
We finally analyzed the impact of the chosen metric and associated cutoff on the success rate, for both interface-RMSD (i-RMSD, Figure 9A) and ligand-RMSD (l-RMSD, Figure 9B), the latter calculated on the entire peptide backbone.Interestingly, for i-RMSD cutoffs below 1.8 A ˚we observe that success rates remains similar for both bound/unbound and unbound/unbound docking (Figure 9A).Increasing the acceptability threshold above 2 A ˚results in more significant differences (,10215% in success rate) between bound/unbound and unbound/unbound benchmarks.The same analysis for l-RMSD reveals almost identical success rates for both bound/unbound and unbound/unbound benchmarks for thresholds below 5 A ˚(Figure 9B).The differences in success rates between bound/unbound and unbound/ unbound docking based on various i-RMSD thresholds (which is not observed for l-RMSD thresholds) suggests that our performance for unbound/unbound docking is affected by the conformational changes of the protein when we discuss them with the CAPRI standard interface-RMSD definition.However, the results are much less influenced by the flexibility of the protein when only ligand or ligand-interface RMSD are considered.Ligand-RMSD based metrics alone therefore overestimate the performance of protein-peptide prediction algorithms.A proper assessment should not only measure the quality of the structural refinement of the peptides, but also address the flexibility of the protein receptor.

Comparison with FlexPepDock and Dynadock
We compared the performance of our protocols with that of FlexPepDock, the only method that has been applied so far on the same protein-peptide dataset, concentrating on the bound/ unbound benchmark since FlexPepDock results over the unbound/unbound dataset are not available.Both assume knowledge of the binding site: while HADDOCK uses a broad definition of the surface of interaction on the protein receptor, FlexPepDock assumes the knowledge of an anchoring residue at the interface.The two methods have significant differences in their approach of protein-peptide modeling but both allow a highly flexible sampling of the peptide.FlexPepDock's performance, assessed on the RMSDs of the peptide backbone only, reaches an overall 52% success rate whereas HADDOCK was able to model acceptable structures for ,70% of the cases using the same metric (Figure 10).Noticeably, FlexPepDock was not able to correctly model any helical cases.When restricted to the non-helical subset of their dataset, FlexPepDock reaches a success rate of 66%, with 49% of the cases containing at least one acceptable structure in the top five solutions.We repeated the same analysis for our results and we observed a similar success rate ,73%, 53% of the cases with acceptable structures in the top five solutions.After clustering, however, HADDOCK reaches a similar success rate with 52% success rate considering the top three clusters.We should however note that FlexPepDock does generate more sub-angstrom resolution models, which might be explained by the different information used to guide the modeling (ambiguous interface versus anchoring residue).
Finally, we compared our protocol to the Dynadock method [49] that uses molecular dynamics simulations to account for the flexibility of the protein receptor.This method was benchmarked Figure 8. Difference in interface-RMSD (i-RMSD) and fraction of native contacts (Fnat) between models from various stages of HADDOCK (it0/it1/water) for unbound/unbound docking using our 3 conformation/enhanced flexibility protocol.The distributions are calculated from all generated models of the unbound/unbound docking benchmark.A negative i-RMSD difference value reflects an improvement (move toward the bound form) while a positive value indicates a deterioration of this i-RMSD.For Fnat this is reverse: a positive difference indicates an improvement.The impact of flexible refinement in torsion angle space (differences between rigid-body docking and flexible refinement (it1-it0)) is shown in A) i-RMSD diff and C) Fnat diff, and the impact of final water refinement (differences between flexible and water refinement (water-it1) is shown in B) i-RMSD diff and D) Fnat diff.doi:10.1371/journal.pone.0058769.g008over 15 complexes that overlapped with our dataset.Dynadock defines a broad interface on the protein side (using a 6.5 A threshold based on the crystal structure) and the peptides are initially ''randomly'' placed at their binding site, yet their orientation along the binding groove is restrained during the initial sampling.Using a ligand-interface RMSD with a threshold of 2.1 A ˚, Dynadock gets a final acceptable solution ranked as first for 11/15 cases.The same analysis for HADDOCK models gives acceptable solutions for 13/15 cases, including sub-angstrom solutions for six cases.

Perspectives
Predicting large conformational changes remains a challenge as indicated by our failure to accurately predict cases where the protein undergoes large conformational changes upon binding.Among the eight cases with conformational changes at the receptor interface above 2.0 A ˚, two of them reveal a complete shift of one helix that 'wraps' the peptide, and the six others exhibit mostly local loop variations or secondary structure rearrangements.For the two first cases, a multi-body docking approach where the flexible domain of the protein would be considered as a separate body for the docking process could significantly improve the modeling.This was successfully applied in the past to proteinprotein docking [50].The problem is rather that such changes are difficult if not impossible to predict.The others cases could benefit from an initial refinement of the protein receptor alone, by molecular dynamics simulation for instance.
HADDOCK is a data-driven method that incorporates information during the docking process to narrow the search.In this work, we defined a broad binding site on the protein receptor directly from the crystal structures, which represents a best-case scenario.When no experimental information is available, one could rely on bioinformatics predictions or other computational methods to predict the interaction surfaces.Peptides seem indeed to recognize ''hot spot'' residues on the protein [51] that might be predictable.A number of approaches have been reported to predict peptide-binding site on proteins [52,53].Such predictors could be useful in the context of HADDOCK.Their performance for docking purposes will however have to be benchmarked in the future.

Protein-peptide Docking Benchmark
We used as benchmark the PeptiDB non-redundant dataset (sequence identity ,70% for any two receptors) of 103 highresolution (X-ray structures; ,2 A ˚resolution) complexes of proteins bound to short peptides (5215 amino acids long) [46].This dataset also contains 47 high quality structures of unbound protein receptors that we complemented with 15 recently released unbound structures of the proteins, making together an unbound dataset of 62 cases (Table C in File S1).
Two complexes were removed from this dataset because of the total inaccessibility of the peptides binding site in the crystal structure (1XOC and 2D5W).For bound/unbound docking, four more cases were removed (1D4T, 1GYB, 2FMF and 2VJ0) because of problems with the coordinates of the proteins.

Benchmark Classification
We divided our benchmark in three classes (easy/medium/ difficult) based on the backbone RMSD between the conformation of the peptide in the crystal structure and its ideal extended conformation.
N easy: RMSD bound/extended #4 A N medium: 4 A ˚,RMSD bound/extended #8 A N difficult:RMSD bound/extended .8A The secondary structure of the peptides was assigned with STRIDE [54].STRIDE encounters some issues to assign short amino-acid sequences that do not show consistent torsion angles for a particular conformation.We therefore considered as extended peptides those for which at least 80% of the sequence was in an ideal extended conformation.An ideal extended polyalanine shows an intramolecular distance between two consecutive C a equal to 3.46 A ˚.If the average C a -C a distance is larger than 3.4660.8= 2.8 A ˚, then the peptide was considered as extended.This can be expressed mathematically in the following manner: where P indicates the peptide, P 1 P n k Ca the end-to-end distance between the first and the last C a and n the length of the peptide.Peptides were classified as helices when STRIDE [54] assigned more than half of their sequence in a helical conformation.All peptides that did not fall into the helical or extended classes were considered as disordered.
This classification scheme resulted in 21 helices, 42 extended and 38 disordered peptides.
To evaluate the conformational change between the unbound and bound forms of a protein, we calculated the RMSD of the protein interface, defining as interacting residues those within 10 A ˚from the peptide in the crystal structure of the complex (Table B in File S1).

Solvent Accessibility Calculations
Solvent accessibility and buried surface areas were calculated using NACCESS [55].We defined a residue as solvent accessible if its side-chain or its backbone has a relative accessibility over 40%.

Modeling of Peptides Starting Conformations
PyMOL [56] was used to generate the three distinct conformation for every peptide.According to standard Ramachandran plots, the helical conformations were modeled with phi and psi angles 257u and 247u, respectively.Polyproline II conformations were built with 278u for phi and 149u for psi.Finally, the extended conformations of the peptides were generated using 2139u for phi and 2135u for psi.

HADDOCK Settings
The bound conformations of the protein-peptide complexes were downloaded from the PDB databank [57].The interface of the protein was defined from the crystal structure as follows: active residues (the defined interface for docking) on the protein side were defined as those within 5 A ˚from the peptide chain.Peptide residues were treated as passive.Random removal of restraints was turned off.Within the HADDOCK process, active residues are enforced to be part of the interface as much as possible by applying ambiguous interaction restraints while passive residue can be part of the interface.HADDOCK will thus try to satisfy as much interactions to active residues.
A typical HADDOCK docking run involves three consecutive steps.First, the molecules are randomly oriented and a rigid body energy minimization is performed (it0).The top ranked models (here top 400) are then addressed to the semi-flexible simulated annealing stage performed in torsion angle space (it1).In this study, this stage has been turned into a fully flexible simulated annealing as described below.Finally, the structures obtained after the semi-flexible simulated annealing are refined in an explicit solvent layer to further improve their scoring (water).
From preliminary tests on a small representative set (20 complexes), an increase in the number of flexible refinement steps by a factor four was found to lead to better conformations.Accordingly, the default number of MD steps for the flexible refinement stage was increased from 500/500/1000/1000 for the four stages of the flexible refinement to 2000/2000/4000/4000.These settings were subsequently applied to the entire benchmark.
The peptides were defined as fully flexible, meaning that sidechain and backbone flexibility is implemented from the start of the refinement stage (it1).On the protein side, only residues that are part of the interface (determined on the fly during docking) are treated as flexible, first allowing only side-chain flexibility followed by both backbone and side-chain flexibility in the final simulated annealing stage of it1.Note that this protocol thus allows for flexibility in the protein, even when starting from the bound form.The RMSD clustering cutoff was decreased from 7.5 A ˚to 5.0 A ˚to take into consideration the smaller size of protein-peptide interfaces.Finally, we specified charged Cter and Nter when we had indication of naturally occurring peptides and uncharged termini when the peptide was a fragment of protein or capped in the crystal structure.
Bound/unbound (extended) docking runs.The number of models generated during the three main stages of HADDOCK (it0/it1/water) was increased to 2000/400/400.
Bound/unbound (3 conformations) docking runs.The number of models generated during the three main steps of HADDOCK (it0/it1/water) was increased to 6000/400/400.In that way, each conformation is sampled 2000 times in the rigid body stage.
Unbound/unbound docking runs.The only change compared to bound/unbound (3 conformation) docking protocol is that the structure of the protein receptor in its unbound form was used, as downloaded from the Protein Data Bank.
All HADDOCK runs were launched on the WeNMR grid version [58] of the HADDOCK server (http://haddock.science.uu.nl/enmr/services/HADDOCK/haddock.php) that makes use of the European Grid Infrastructure (EGI) computing resources.On average, each run took between five and six hours to complete on the grid.

Quality Assessment Criteria
In order to assess the quality of models generated by HADDOCK we criteria as defined by the CAPRI experiment [34,59].These were however reduced compare to standard protein-protein docking to account for the small size of the peptides.The quality of docking models was assessed using the interface RMSD (i-RMSD) as follows: N Not acceptable: i-RMSD .2A N Near-native prediction: 1 A ˚# i-RMSD #2 A N High-quality (sub-angstrom) prediction: i-RMSD #1 A The i-RMSD is calculated on the backbone atoms of both protein and peptide residues that are within 10 A ˚of the partner molecules (as defined based on the crystal structure of the complex).The l-RMSD, when mentioned, is calculated on the backbone atoms of the peptide only, after fitting on the backbone atoms of the protein.
We further refer to as ''acceptable models'' any near-native or better (sub-angstrom) predictions.

Figure 1 .
Figure 1.Protein-peptide benchmark characteristics.(A) Distribution of positional backbone RMSDs between the bound form of the peptides present in the benchmark and an ideal extended conformation.These are classified into three categories (easy, medium and difficult) based on the amplitude of the conformational change upon binding.(B) Percentage of solvent accessible residues computed for all peptides in the crystal structures of the respective protein-peptide complexes.doi:10.1371/journal.pone.0058769.g001 5 A ˚threshold (Figure A in File S1).

Figure 2 .Figure 3 .
Figure2.Overall HADDOCK results for (A) bound/unbound (extended), (B) bound/unbound (3 conformations) and (C) bound/ unbound (3 conformations) with enhanced flexibility.The percentages of near-native and sub-angstrom resolution models at the various stages (rigid-body (it0), semi-flexible (it1) and water refinement (water)) are reported in the left panels and were calculated over the 400 final models generated by HADDOCK.The right panels show the percentages after water refinement as a function of the docking difficulty.doi:10.1371/journal.pone.0058769.g002

Figure 4 .
Figure 4. Unbound/unbound docking performance using the conformational selection/induced fit HADDOCK protocol.The percentages of near-native and sub-angstrom resolution models (see Methods) at the various stages (rigid-body (it0), semi-flexible (it1) and water refinement (water)) are reported in the left panels and were calculated over the 400 final models generated by HADDOCK.The right panels show the percentages after water refinement as a function of the docking difficulty.doi:10.1371/journal.pone.0058769.g004

Figure 5 .Figure 6 .
Figure 5. Success rate of unbound/unbound docking as a function of the number of top models considered.A docking is defined as successful it at least one near-native model is present within the topXX selected models.doi:10.1371/journal.pone.0058769.g005

Figure 7 .
Figure 7. Examples of HADDOCK best models for the challenging unbound/unbound cases.The PDB-id as well as difficulty, peptide length, rank and i-RMSD values are indicated for each case.The model selected for illustration is the acceptable model with the best rank at the end of the HADDOCK process.The model peptide is shown in purple together with the reference peptide in the crystal structure of the complex in black.Docking model and crystal structure were superimposed on backbone atoms of the protein.The protein (crystal structure) is shown in surface representation.(A) 1NX1, (B) 1CZY, (C) 1LVM and (D) 1D4T.Figure generated with PyMol [56].doi:10.1371/journal.pone.0058769.g007

Figure 9 .Figure 10 .
Figure9.Impact of the (A) i-RMSD and (B) l-RMSD cutoffs defining a near-native solution on the docking performance.In this analysis, a docking run is defined as successful if at least one near-native model (for the selected cutoff) is generated within the pool of 400 waterrefined models.Results are presented for both bound/unbound (97, black) and unbound/unbound (62, gray) cases.doi:10.1371/journal.pone.0058769.g009