Structural Interface Parameters Are Discriminatory in Recognising Near-Native Poses of Protein-Protein Interactions

Interactions at the molecular level in the cellular environment play a very crucial role in maintaining the physiological functioning of the cell. These molecular interactions exist at varied levels viz. protein-protein interactions, protein-nucleic acid interactions or protein-small molecules interactions. Presently in the field, these interactions and their mechanisms mark intensively studied areas. Molecular interactions can also be studied computationally using the approach named as Molecular Docking. Molecular docking employs search algorithms to predict the possible conformations for interacting partners and then calculates interaction energies. However, docking proposes number of solutions as different docked poses and hence offers a serious challenge to identify the native (or near native) structures from the pool of these docked poses. Here, we propose a rigorous scoring scheme called DockScore which can be used to rank the docked poses and identify the best docked pose out of many as proposed by docking algorithm employed. The scoring identifies the optimal interactions between the two protein partners utilising various features of the putative interface like area, short contacts, conservation, spatial clustering and the presence of positively charged and hydrophobic residues. DockScore was first trained on a set of 30 protein-protein complexes to determine the weights for different parameters. Subsequently, we tested the scoring scheme on 30 different protein-protein complexes and native or near-native structure were assigned the top rank from a pool of docked poses in 26 of the tested cases. We tested the ability of DockScore to discriminate likely dimer interactions that differ substantially within a homologous family and also demonstrate that DOCKSCORE can distinguish correct pose for all 10 recent CAPRI targets.


Introduction
Protein-protein interactions which constitute the ''interactome'' of the cell are found in the majority of cellular processes and are known to regulate responses of organisms to the varied environments. Protein molecules in isolation cannot perform any function in the cell; it is the ensemble and their organisation into complexes which maintain the cellular integrity [1]. There are excellent biochemical/experimental techniques, like yeast two hybrid and co-immunoprecipitation, to identify the interacting pairs of proteins. There have been many experimental and computational efforts to use the three-dimensional structures of proteins and identify the interacting pairs of proteins. Molecular docking is a computational technique which is used to predict protein-protein, protein-ligand interactions. It involves the identification of multiple conformations for the interacting partners and then calculating energies for interactions like electrostatics, Van der Waals etc. Initially, the conformational space is searched to find the possible solutions and then these docked poses are scored in order to identify the biological meaningful poses. It is a daunting task to distinguish the biological meaningful poses from a pool of solutions as proposed by the docking algorithm. It would be interesting to computationally predict the interacting pairs of proteins and this will help to gain insights into the molecular interactions behind the cellular process it is involved in.
There are many docking algorithms available like DOCK [2], Autodock [3], ZDOCK [4], FRODOCK [5], GRAMM [6], GOLD [7] etc. As a result of molecular docking, different numbers of docked poses are proposed by the docking algorithms. There is a challenge of scoring these docked poses and hence ranking them to identify the native or near-native structure [8]. In the field, there have been previous attempts to rank the docked poses [9], [10].However, the maximum accuracy achieved is ,77% and it takes into account only tightness of the interface [9]. Also, conservation of residues has been considered for selecting the near-native conformations from the pool of docked poses [8]. A recent knowledge-based method, DockRank derives the information from homologous interacting proteins for predicting the interface and ranks the docked poses based on the overlap with the predicted interface [11].
When the two components forming the complex interact with each other there is a formation of protein-protein interface. These complexes can be divided into two types based on the nature of interacting partners, homodimers where the two interacting partners are identical and heterodimers where the interacting partners are different. Various properties of this interface formed upon interaction can be utilised to identify the native complex from the pool of proposed docked poses. Here we are describing a new objective scoring scheme named DockScore which takes into account several interface parameters. The weights for these parameters are optimised to improve the accuracy in identifying native or near-native pose.
DockScore aims to determine the optimal interactions by taking into account various parameters of the interface. These parameters include surface area, spatial clustering, conservation of residues, presence of short contacts, hydrophobic residues and positively charged residues at the interface. The scoring method was first trained on a set of 30 non-homologous protein-protein complexes (15 homodimers and 15 heterodimers) to optimise the weights for different parameters. Further, performance of Dock-Score was assessed on 30 different protein-protein complexes and the native structure (deposited in PDB) or near-native structure was assigned as the best ranked complex from the pool of docked poses.
We further identified some examples of the protein families where different modes of dimerizations are known and tested DockScore for its ability to distinguish these native dimerization modes. It was also tested on 10 of the CAPRI targets.

Materials and Methods
An objective scoring scheme, DockScore, was developed in order to find the optimal interactions between the two interacting proteins. The scoring scheme was first trained on a set of 30 protein-protein complexes (listed below) and then applied on another set of 30 protein-protein complexes to recognise the best docked pose.
Various features were taken into account and each is described below.
The interface residues were identified using inter-chain C b -C b distance cut-off of 7 Å .

Training and test dataset and generating docked poses
A dataset of 30 protein-protein complexes was selected from previous studies, which includes 15 homodimers [12] and 15 heterodimers [13]. The protein chains in this dataset do not share more than 25% identity. The two chains from the complexes were segregated into receptor (chain A) and ligand (chain B). The coordinates of receptor chain were altered using the SYBYL software package (Version 7.1) (Tripos Associates Inc.). Further, molecular docking for test dataset was performed using FRO-DOCK [5] and 99 different docked poses were obtained for all the 30 complexes. The native structure for each of the complex deposited in The Protein Databank (PDB) [14] was added to the pool of 99 docked poses to generate a set of 100 poses for each PDB ID. The overall workflow of DockScore is represented as a schematic in Figure 1.
The same protocol as above was followed to obtain a dataset of 30 protein-protein complexes (15 homodimers and 15 heterodimers) and their docked poses were used to test the performance of DockScore. The chains within and across the two datasets (test and training) do not share more than 25% sequence identity.  Conservation of residues. Conserved interface residues were identified using ConSurf [15] server. The interface residues (N) from both the chains of the complex with conservation grade of 7, 8 and 9 were considered as conserved interface residues (C i ).

Interface parameters used in scoring
Scoring based on conservation was performed by assigning high score to the presence of high extent of amino acid conservation at the interface (C i ) (Equation 3) and normalising it by the number of interface residues (N).
Inter-chain short contacts. Short contacts here imply that two atoms do not closer than the sum of their VanDer Waals' radii. This is to avoid steric clashes and stabilise the complex as a whole. Our program CoilCheck [16] was employed to obtain the Van der Waals interaction energy between the two chains of the protein complex. This energy was used as a measure of the short contacts present at the interface. Since short contacts are undesirable, scores were assigned to the docked pose according to Equation 4, such that pose with high and positive energy obtains low score.

Score shortcontacts~E nergy highest {Energy complex Energy highest {Energy lowest ð4Þ
Here Energy highest is the highest energy value in the pool of 100 poses, Energy lowest is the lowest energy value in the pool of 100 poses and Energy complex is the energy value of the pose being scored.
Spatial Clustering. It was employed as a measure of the compactness at the interface. It was calculated for the interface, by computing the pairwise distances between the interface residues between the two chains and the residues with a C b -C b distance cut-off of 14 Å were considered as spatially clustered residues. Coefficient of clustering (h) was calculated using Equation 5, where'd' is the number of residues within the cut-off distance and 'N' is the number of interface residues at both chains.
Hydrophobic residues. Protein-protein interfaces of permanently interacting partners are known to be largely hydrophobic in nature [17][18][19]. We ranked those docked poses with high numbers of hydrophobic residues with a high score -as shown in Equation 6, where H i is the number of hydrophobic (A,V, L, I, M, F, W and Y) interface residues and N is the number of interface residues.
Positively charged residues. In the special case of interaction between DNA-binding proteins or transcription factors, we penalise the presence of positively charged residues at the proteinprotein interface using Equation (7), where P i is the number of positively charged residues interface residues (K, H and R) and N is the number of interface residues.  Energy of the docked pose after minimisation. The docked complex was solvated in cubic water box with space water model and box edges 8 Å from molecule periphery. For large complexes, water box with box edges 10 Å from molecule periphery was used. Positive (Na + ) or negative (Cl 2 ) ions were added to achieve charge neutrality. Following this, docked poses were subjected to energy minimisation using software GRO-MACS 4 [20], employing OPLS force field with 5000 steps of steepest descent. The final energy values were converted to log scale and the scoring was performed using Equation (8)  Evaluation of the Score Using all the above mentioned parameters, final score was derived by assigning weights to these parameters. The evaluation was performed in two rounds: in the first round, structures after energy minimisation were subjected to DockScore and in the second round, structures without energy minimisation were used.
The accuracy of the DockScore was measured by enumerating the number of complexes for which the native pose was ranked as the best pose (Equation 9).

Accuracy~S uccess complexes Total complexes ð9Þ
The weights were assigned to all the seven parameters employing two approaches. Firstly, the ranking of the parameters was done by using Leave-one-out approach and then assigning weights using Rank sum method (Equation 10, where K is the total number of parameters and r i is the rank of i th parameter).  Secondly, each parameter was assessed by using only that parameter alone and calculating the number of correct predictions. The weights derived were further normalised by the total sum of weights.

Proteins with different dimerization modes
The performance of DockScore was further tested on proteins belonging to same family and possessing different dimerization modes: GAF proteins, IL-8 superfamily proteins, DJ-1 superfamily and H-NS histone like proteins superfamily. For the different modes of dimerization known, the structures were obtained from PDB and a similar methodology was adopted to perform docking and then the docked poses were subjected to DockScore to access if it was able to distinguish between the different dimerization modes.

Testing on earlier CAPRI targets
Ten of the CAPRI targets were selected to test the preformance of DockScore. These were selected to include enzyme-inhibitor (3), antigen-antibody (2) and other protein-protein complexes (5).

Evaluation of DockScore performance on training set
DockScore was first trained on a set of 30 protein-protein complexes (Tables 1 and 2) and weights were assigned to different parameters used in scoring. 99 docked poses were generated for each of the 30 complexes using FRODOCK docking tool. DockScore takes into account several parameters of the interface. Firstly, it identifies the interface residues and then scores for conservation, short contacts, spatial clustering and the nature of residues at the interface (see Materials and Methods).
When the energy minimised poses (obtained using GROMACS v 4.5) were used, and the weights were assigned using one parameter at a time (Table S1). However, the accuracy observed was only ,27%.
Subsequently, the poses were examined without being subjected to energy minimisation and the performance of scoring was assessed by enumerating the complexes where DockScore accurately assigned to rank to the native form. Assigning equal weights to all the parameters resulted in an accuracy of 77% (pvalue = 0.0002). Next, the scores were assigned by removing one parameter at a time (Table S2) and also by using only one parameter at a time ( Table 3). The weights, when assigned employing the latter approach, resulted in an accuracy of 96% (29 out of 30 protein-protein complexes could be identified well).

Performance of DockScore on test dataset
After training the scoring scheme on a set of 30 protein-protein complexes, we selected another set of 30 complexes (Table 4 and 5) to test its performance and to assess the weights for each parameter which were optimised using training dataset. The 30 protein-protein complexes in the test dataset were kept different from those in the training dataset (please see Methods; Figure S1 and S2 reflect this in form of principal component analysis : PCAplot).
DockScore was able to distinguish the native pose from a pool of docked poses in 26 of the complexes in the test dataset (87% accuracy, these 4 cases belong to homodimers set). In the cases, where non-native pose was ranked higher than the native, there was a small difference (0.02-0.1) in the scores as assigned by DockScore. The non-native top scoring poses were observed to be structurally similar to the native pose ( Figure 2). We further, calculated the fraction of overlap of the interface residues between the native and the top-ranked non-native pose. We observed that this fraction of overlap lies in the range of 0.32-0.98.
We further compared the performance of DockScore with two of the previously reposrted scoring methods namely dDFIRE [21] and FireDock [22]. As mentioned before, DockScore ranked native structure as top-most structure for 26 of the complexes (for rest four cases, refer Figure 2: top-ranking pose is structurally similar to the native complex structure), whereas dDFIRE and FireDock report this in 16 of the complexes (Figure 3).

Superfamilies with different dimerization modes
DockScore was further assessed for its performance on the four unrelated superfamilies whose members are known to possess different dimerization modes: GAF proteins, IL-8 superfamily proteins, DJ-1 superfamily and H-NS histone-like proteins superfamily.
GAF domains are present in cGMP regulated cycilc nucleotides phosphodiesterases, certain adenylyl cyclases and bacterial TF FhlA (formate hydrogen lyase transcription activator) [23]. We selected three structures of proteins possessing GAF domains and having different modes of dimerization (1YKD, 1MC0 and 1F5M). 1YKD is a 1.9 Å resolution crystal structure of cyaB2/ GAF A and GAF B antiparallel dimer from Anaebaena [24], 1MC0 is 2.9 Å PDE2/GAF A and GAF B parallel dimer from Mus musculus [25] and 1F5M is 1.9 Å crystal structure of YKG9 protein from S.cerevisiae [23]. DJ-1 superfamily of proteins is known to exhibit different dimerization modes [26], there are four distinct patches of interface known in this superfamily. We selected a protein dimer complex corresponding to each of the four distinct types of dimerization modes to exemplify four respective interface patches (1IZY, 2VRN, 1QVV and 1VHQ).
SCOP also documents different modes of dimerization in certain superfamilies, like H-NS histone like proteins superfamily and IL-8 superfamily. We selected proteins with different modes of dimerization known in these superfamilies (for IL-8 like superfamily: 2IL8, 1DOK and for H-NS superfamily: 1OV9, 1LR1).
Using one chain as receptor and another as ligand, 99 docked poses were generated for each of the different complexes, employing FRODOCK. These 99 docked poses and the native structure, were subjected to DockScore.
For two of the superfamilies, IL-8 like and H-NS histone like proteins superfamily, DockScore accurately scores the native pose as the top ranked pose. In the case of GAF domains, out of the three different dimerization modes tested, DockScore accurately scores native pose as top-ranked pose in two modes. In the DJ-1 superfamily, out of the four different modes tested, native poses in two modes were accurately ranked as the top pose using DockScore. These cases, where non-native pose was ranked higher than the native, were further analysed.
For the DJ-1 superfamily, the top-ranked pose and the native pose were very similar and they differ in score by only 0.1 (Figure 4a) and the observed top-ranked pose in one of the dimerization mode (1IZY) also closely resembles the native dimerization mode seen in 1VHQ (Figure 4a). In one of the high-scoring trials corresponding to the dimerization mode of GAF domains, 1MC0, the difference in the scores of the native pose and the top-ranked pose was 0.2 but the interface of the topranked pose and the native pose were observed to share significant overlap (Figure 4b). Comparison of DockScore performance with other methods. The ability of DockScore to rank native structure as the top-most pose, was compared with two of the other methods namely dDFIRE and FireDock. The native structure was assigned top-most rank in 26, 16 and 16 complexes with DockScore, dDFIRE and FireDock respectively. The all-atom RMSD for smaller chain was computed between the native structure and the top-ranking pose as identified using the three methods. doi:10.1371/journal.pone.0080255.g003

Proteins in bound and unbound forms
In order to examine the effect of conformational changes and large-scale rotations that might accompany during interactions with other proteins, we next chose five proteins that are available in the bound as well as protein-unbound form as compiled in ComSin database [27]. DOCKSCORE could successfully rank the crystal structure as the highest and identify the best docked pose with high rank for the five proteins when started from the bound-conformation of the protein (Table S3). However, this happened less easily and only for two cases, when starting from the unbound-conformation of the protein confirming that recognition of the correct docking pose can remain a bottle-neck where there are substantial structural changes during protein-protein interactions.

Testing on CAPRI targets
DockScore performance was further evaluated on 10 of the earlier CAPRI targets (Table 6). This was performed to understand the sensitivity of DOCKSCORE in scoring different docked poses in CAPRI targets. Different docked poses were obtained using FRODOCK (as explained earlier for the training and testing datasets) and the native pose was included as in PDB file.
We were able to assign the top-most rank to the native structure in 8 out of 10 targets. In remaining two cases (3R2X and 1TA3), where some docked pose obtained a higher score than native, were further analysed for the structural proximity of high-ranking docked pose to the native pose. These poses were observed to be well-superposed with the native pose ( Figure 5) and the difference in the scores of the native and the top-ranking non-native poses were 0.05 and 0.07 respectively. Therefore, we were able to  Table 6. List of 10 earlier CAPRI targets (with chain identifiers in parentheses) employed to evaluate DockScore performance. accurately distinguish the native (or near-native) pose, out of a pool of docked complexes in all 10 of the CAPRI targets (Table S4).

Discussion
Since accurate structure determination of the macromolecular complexes is highly challenging and biologically important, prediction of protein-protein interactions through molecular docking is highly appropriate. However, the implementation of molecular docking for studying the interactions between a pair of proteins poses a challenge to identify the best docked pose out of the pool of various poses suggested by the docking program. For identifying the best docked pose, we devised an objective scoring scheme, named DockScore, which takes into account several interface parameters and hence ranks the docked poses.
The interface residues are first identified for each pose and the docked poses are ranked based on the parameters like interface surface area, spatial clustering at the interface, presence of hydrophobic, positively charged residues, conservation of residues and undesirable short contacts at the interface. The contribution of the energy of each pose was also tested by subjecting the energy minimised poses for scoring to DockScore. Before testing the performance of the scoring scheme for selecting the best pose, it was first trained on 30 protein-protein complexes. The weights were assigned, using one parameter at a time, indicating the importance of each parameter in the final score and thereby accuracy could be tuned close to 96%. Subsequently, it was tested for its accuracy on a test dataset, which comprised of 30 other complexes.
There were several tests adopted to assess the performance of DockScore: 1. The scoring was considered accurate, when DockScore was able to rank the native structure (deposited in PDB) or nearnative structure as the best rank pose.
2. If the non-native poses were ranked higher than the native pose, we further analysed the difference in their scores assigned by DockScore and the fraction of overlapping interface residues. 3. DockScore was further tested on four superfamilies, where different dimerization modes are known, for its ability to distinguish the native pose from a pool of docked poses. 4. This scoring scheme when further applied on 10 of the earlier CAPRI targets was able to distinguish the native pose from the pool of 100 docked poses in all 10 targets.
In the present study, FRODOCK has been used to obtain close to 100 docked poses. However, the objective scoring scheme devised can be used in general to rank the docked poses suggested by any docking program and for a variety of protein-protein interactions. In the near future, DockScore would be tested on other protein-protein complexes and other docking programs can also be used to test the accuracy of DockScore.
These kinds of studies will aim to provide detailed insights into the interactions amongst proteins which are quite direct in nature. Also, the existence of interactions among protein factors regulating the direct interactions will provide an additional level of regulation on one hand, whereas it will also lead to the combinatorial diversity of regulatory complexes. With different combinations of these factors, regulation of diverse numbers of genes can be achieved. Therefore, studying the physical interactions amongst proteins will provide useful insights into unravelling the basis of this combinatorial diversity in eukaryotes. Figure S1 Principal component analysis of 15 homodimers in the training set (corresponding to Table 1) and 15 homodimers in the test set (corresponding to Table 4). High dispersion shows that there is no bias or high sequence identity across the training and test dataset. (TIF) Figure S2 Principal component analysis of 15 heterodimers in the training set (corresponding to Table 2) and 15 heterodimers in the test set (corresponding to Table 5). High dispersion shows that there is no bias or high sequence identity across the training and test dataset. (TIF)