Accuracy of Protein-Protein Binding Sites in High-Throughput Template-Based Modeling

The accuracy of protein structures, particularly their binding sites, is essential for the success of modeling protein complexes. Computationally inexpensive methodology is required for genome-wide modeling of such structures. For systematic evaluation of potential accuracy in high-throughput modeling of binding sites, a statistical analysis of target-template sequence alignments was performed for a representative set of protein complexes. For most of the complexes, alignments containing all residues of the interface were found. The full interface alignments were obtained even in the case of poor alignments where a relatively small part of the target sequence (as low as 40%) aligned to the template sequence, with a low overall alignment identity (<30%). Although such poor overall alignments might be considered inadequate for modeling of whole proteins, the alignment of the interfaces was strong enough for docking. In the set of homology models built on these alignments, one third of those ranked 1 by a simple sequence identity criteria had RMSD<5 Å, the accuracy suitable for low-resolution template free docking. Such models corresponded to multi-domain target proteins, whereas for single-domain proteins the best models had 5 Å<RMSD<10 Å, the accuracy suitable for less sensitive structure-alignment methods. Overall, ∼50% of complexes with the interfaces modeled by high-throughput techniques had accuracy suitable for meaningful docking experiments. This percentage will grow with the increasing availability of co-crystallized protein-protein complexes.


Introduction
Protein interactions are a central component of life processes. The structural characterization of these interactions is essential for our ability to understand these processes and to utilize this knowledge in biology and medicine. Experimental approaches, primarily X-ray crystallography, are producing an increasing number of protein structures (www.pdb.org), which to a certain extent are representative of a significant part of the ''protein universe.'' However, the overall number of proteins by far exceeds the capabilities of the experimental structure-determination approaches [1,2]. The answer to this discrepancy is computational modeling of protein structures. The modeling not only can supply the vast majority of protein structures, but also, importantly, is indispensable for understanding the fundamental principles of protein structure and function.
Computational structure prediction methodology historically started with ab initio approaches based on approximation of fundamental physical principles, and continues to develop in this direction for the goal of learning the principles of protein structure and function. However, for the purpose of predicting protein structures, it has largely evolved to comparative techniques based on experimentally determined structural templates (to a significant extent due to the increasing availability of such templates). Such approaches are faster, more reliable, and provide accuracy increasingly comparable with experimental approaches [3].
A similar trend is underway in structural modeling of protein interactions -protein docking [4,5]. Because of the nature of the problem, the ab initio structure-based methods in docking (prediction of the complex from known separate structures) are relatively more reliable than those in individual protein modeling (docking rigid-body approximation has only six degrees of freedom and has an established record of practical applications). However, the knowledge-based docking approaches, including the template based ones, are rapidly developing, following the increasing availability of the experimentally determined structures of proteinprotein complexes, which generally are more difficult to determine than the structures of individual proteins [6][7][8]. It was established by studies based on different sets of proteins that proteins similar in sequence, fold and/or function share similar binding sites [9][10][11][12].
Quantitative guidelines for quality of homology modeling of protein complexes were provided by Aloy and others [13] where it was demonstrated that sequence identities .40% yield high similarity of protein-protein binding sites.
The modeling techniques for proteins and protein complexes applicable to entire genomes have to be high-throughput by design. This reason, along with the still limited availability of templates, causes the modeling techniques to combine highresolution approaches, when available and computationally feasible, with low-resolution capabilities, for broad coverage of the proteome/interactome. Such low-resolution approaches still are capable of predicting essential structural characteristics of proteins and protein interactions, including the binding sites [14][15][16], macromolecular assemblies [17] and binding modes for protein-protein [18,19] and protein-ligand [20] complexes.
For template based docking (based on co-crystallized proteinprotein templates), the degree of similarity to the templates is key to the accuracy of the docking. For ab initio, as well as some knowledge/template based docking techniques, the accuracy of the resulting structures is directly dependent on the accuracy of the individual participating proteins, which in its turn is based on the similarity to the templates of individual proteins. In both cases, the critical component affecting the docking outcome is the ability to model the structures of the binding sites. Although one can argue that the structure of the whole proteins is important in general, the binding sites are the parts that have a direct effect on the accuracy of the predicted complex. Earlier estimates showed that the binding site accuracy of ,6 Å C a RMSD is sufficient for lowresolution ab initio docking [19] (,3 Å C a RMSD for small ligandreceptor docking [20]), with even lower accuracy suitable for meaningful docking prediction by template based docking (Sinha et al. in preparation).
In the current study we present a systematic analysis of the sequence alignment and subsequent modeling accuracy of known protein-protein binding sites. The analysis is performed and validated on the DOCKGROUND comprehensive dataset of cocrystallized protein-protein complexes [21]. According to the purpose of this study (the assessment of high-throughput modeling capabilities for genome-size systems) the modeling was deliberately performed in a high-throughput fashion using standard alignment (BLASTPGP [22]) and comparative modeling (NEST [23]) programs, as opposed to more detailed and sophisticated (but also more computationally expensive) multi-template procedures. The results show that for a significant part of the proteins the binding sites can be modeled with accuracy that would ensure meaningful docking, even in cases of alignments considered poor for modeling of monomeric proteins. Thus, structural modeling of protein-protein interactions can often be performed by means simpler than those typically used for modeling of monomeric proteins, despite the fact that protein-protein interactions in general are on the next complexity level relative to individual proteins. However, further advancement of large scale, highthroughput docking requires progress in experimental determination of structural templates.

Interface Coverage in Local Alignments
To assess the potential quality of binding site modeling, the sequences of 658 two-chain complexes (Table 1) were subjected to PSI-BLAST search for homologous sequences in the PDB data bank. The following alignments were excluded from the resulting pool: (a) statistically insignificant alignments with expectation value e.1 and (b) alignments with target/template difference ,10 residues. The latter allowed us to avoid a bias in alignment statistics caused by overrepresentation of certain groups of the proteins and their mutants in PDB. The resulting 66,706 alignments were further analyzed in terms of the target sequence coverage q (see Methods, Eq. 1), and coverage of the target interface residues q int (Eq. 2), with an emphasis on alignments with q int = 100% (hereafter referred to as full interface coverage, or FIC, alignments). A residue of the target complex was assigned to the interface if the distance between any atom of the residue and any atom of the other subunit in the complex was less than the sum of the van der Waals radii of the atoms plus the diameter of water molecule 2.8 Å . An alignment was considered FIC with a level of tolerance that allowed one target interface residue to be missing in the alignment. The analysis showed that 37,062 alignments, or 56.1% of the entire alignment pool, are FIC alignments. On the other hand, FIC alignments were observed for both monomers in alignments of 218 target complexes and for one of the monomers in additional 101 targets, which together constitute most (97%) of the dataset.
In the distribution of FIC alignments for different functional classes of proteins (Table 2), notably, but not surprisingly, antibody-antigen complexes representing a fraction (3.6%) of the protein set, produce a significant part of all alignments (17.5%, or ,970 alignments per target complex), with FIC alignments for both monomers in all 12 cases. Interestingly, in two other functional classes (enzyme-inhibitor and cytokine receptor) the FIC alignments were observed at least for one monomer in almost 100% of cases as well, with the only exception of 1e44, for which PSI-BLAST did not find any homologous sequences in PDB. Out of 11 cases in the 'other' functional class, for which no FIC alignments were found, 8 cases had no statistically significant alignments. In 3 complexes (1o6s, 1tt5, and 1zm2) the interface consisted of terminal residues only. Thus the interface coverage could have been significantly reduced by absence of these terminal residues in an alignment, which is often the case in local alignments.
For further analysis we introduced parameter q max , the maximal target sequence coverage in a subgroup of alignments and counted the number of alignments (all or FIC only) in subgroups corresponding to q#q max = 40, 50, 60, 70, 80, 90, and 100% (the entire alignment pool). The results in Figure 1 show that even when the target sequence coverage does not exceed 40%, there is a significant number of FIC alignments (191 out of 9,358 alignments with q max = 40%). Although these FIC alignments constitute ,2% of alignments with q max = 40%, they are still sufficient for statistical analysis. The absolute lengths of these alignments range from 32 to 220 residues (for 86 and 631 residue proteins, respectively), covering from 8 to 40 interfacial residues. The quality of the alignments is rather poor (the range of the expectation values is from 2610 248 to 1.0, the sequence identities vary from 6.5% to

Author Summary
Protein-protein interactions play a central role in life processes at the molecular level. The structural information on these interactions is essential for our understanding of these processes and our ability to design drugs to cure diseases. Limitations of experimental techniques to determine the structure of protein-protein complexes leave the vast majority of these complexes to be determined by computational modeling. The modeling is also important for revealing the mechanisms of the complex formation. The 3D modeling of protein complexes (protein docking) relies on the structure of the individual proteins for the prediction of their assembly. Thus the structural accuracy of the individual proteins, which often are models themselves, is critical for the docking. For the docking purposes, the accuracy of the binding sites is obviously essential, whereas the accuracy of the non-binding regions is less critical. In our study, we systematically analyze the accuracy of the binding sites in protein models produced by high-throughput techniques suitable for large-scale (e.g., genome-wide) studies. The results indicate that this accuracy is adequate for the low-to medium-resolution docking of a significant part of known protein-protein complexes.
39%, and the gaps constitute up to 32% of the alignments). Such short alignments are generally considered poor in homology modeling of monomeric proteins. However, they can arguably be used for accurate modeling of protein-protein interfaces if all residues of the target interface are present in the alignment. Such interface modeling would provide accuracy sufficient not only for a meaningful analysis of binding properties, but also for docking of 3D models of monomers. Such docking is important for large-scale modeling of protein-protein complexes because modeling based on homology to co-crystallized protein-protein complexes accounts for only 15-20% of all known interactions [24,25].

Identity and Similarity of Interface Alignments
It is important to determine if FIC alignments have properties that distinguish them from the whole pool of alignments. The knowledge of such properties would help in ''real'' homology modeling where interface residues are not known in advance and only the information related to the alignment properties, such as alignment expectation value e, and/or alignment identity a iden and similarity a sim (Eq. 3), is available. For this purpose we compared the distributions of e, a iden and a sim for FIC alignments and for all alignments with maximum target sequence coverage q max (see Figure 2). The results show that e-distributions (data not shown) do not differ significantly between the FIC alignments and all alignments, irrespective of q max values with a weak tendency of the FIC alignments to have e values lower than those in the whole pool of alignments. This difference is small and can be hardly used in practical discrimination of the FIC alignments.
The pattern of distributions of other alignment parameters is different ( Figure 2). Whereas for the alignments with q max = 100% there is no large difference between the FIC and all alignments ( Figure 2B, D), the FIC alignments with q max = 40% show a distinguishable difference from all alignments (Figure 2A, C). For example, the part of the FIC alignments with a iden between 15 and 20% (84 out of 191) is two times larger than for all alignments (2124 out of 9358; Figure 2A). This difference is even more pronounced for the a sim distributions ( Figure 2C), where the part of alignments with a sim between 15 and 20% is four times larger for the FIC alignments (33 out of 191 as opposed to 459 out of 9358 for all alignments). We can hypothesize that this is due to a larger evolutionary distance between the target and the template proteins in alignments containing only a small part of the target sequence. Binding sites tend to be more conserved than the rest of the surface in evolutionary related proteins [26]. Such proteins usually correspond to ''good'' alignments with high target sequence coverage and alignment identity. This assumption is indirectly supported by the distributions of all alignments shown in Figure 2B, D where the fraction of the FIC alignments is larger at higher values of alignment identities and similarities, whereas at lower a iden and a sim the situation is opposite. Figure 3 shows the distributions, similar to those in Figure 2, but only for the residues that belong to the target binding site (these residues do not necessary form continuous stretches of the protein sequence). To avoid ambiguities in definition of interface identity and similarity (Eq. 4) for the alignments with no or little interface coverage, only FIC alignments are considered. The distributions of interface identity i iden and similarity i sim qualitatively are similar to distributions of a iden and a sim . The main difference is the positions of distribution maxima, which are shifted towards smaller values, compared to corresponding maxima positions in the a iden and a sim distributions. The largest difference is in the i iden distribution for the short alignments, with the maximum for i iden between 5 and 10% as opposed to 15 to 20% for the a iden distribution. The distributions for the interface residues are also slightly broader than corresponding distributions for the whole alignments. For example, the peak in a iden accounts for ,20% of the alignments while corresponding peak in the i iden distribution amounts only to ,15% of the alignments. This is consistent with the previous assumption that alignments with small target sequence coverage are observed for evolutionary distant proteins where interface conservation is not evident. It is important to note that there are significant parts of the alignments with no identity in binding site residues (,6% for the whole pool of FIC alignments in Figure 3B, and ,15% for the short FIC alignments in Figure 3A) whereas there are no alignments with zero alignment identity overall (Figures 2A, B). This result by itself is not surprising since alignments with no identical aligned residues have expectation value so high that they are considered statistically insignificant and are not included in the PSI-BLAST output. On the other hand, there are no alignments with zero similarity (no similar residues at all) for the short alignments ( Figure 3C) and almost no such alignments (,1%) for the whole alignment pool ( Figure 3D). This suggests that even for proteins distant in evolution the interface conservation may play some role, although at more complex level than simple amino acid preservation.

Probability to Find All Interface Residues in an Alignment
For practical modeling of protein complexes it is important to estimate if the interface residues are inside an alignment based on the alignment properties only. For this purpose we determined the number of FIC alignments having certain range of alignment identities/similarities (with a window of 5%) and the number of all alignments having the same range of identities/similarities values. The ratio of those two numbers gives a probability to find all interface residues inside an alignment (or FIC alignment probability) with given identity/similarity. The calculations performed for the alignments with q max ranging from 40% to 100% did not find significant differences in the resulting trends.
For better visualization (lower statistical noise) Figure 4 shows the FIC alignment probability as a function of alignment identity and similarity for the whole alignment pool (q max = 100%) only. Because of representative nature of our dataset of complexes, we can argue that the observed trends in this dataset will hold in the general case. Thus, we can assume that for the alignments with identity .40% (similarity .60%), the probability to find all interface residues in a given alignment is $80%. This observation relates to the above suggestion that in the alignments with higher identity/similarity, proteins are closely evolutionary related. It was demonstrated in previous studies of ion binding proteins [27], mitochondrial carriers [28], glycolitic enzymes [29], cyclic dependent kinases [30], and other protein families [26,31] that the binding sites in closely related proteins are more conserved than the rest of the surface. Thus, the alignment programs (such as PSI-BLAST used in this study) more reliably identify these highly conserved regions, increasing chances to have full binding sites inside an alignment irrespectively of the alignment length. One can argue that this is a nonessential observation since it is well   established in homology modeling of individual proteins that model building from the alignment with identity .40% is a trivial task since the fraction of correctly aligned residues in such alignments is approaching 100% (e.g., see Fig. 1B in Ref. [32]). However, the importance of our finding is that it provides a simple recipe for evaluating suitability of a particular alignment for building partial homology model of a protein complex of interest with good accuracy in the interface region.

Partial Structural Models
As mentioned above, there is a significant amount of alignments with low target sequence coverage containing all residues belonging to the interface of the target complex. To assess if such short alignments are useful for structural modeling of protein complexes, we built the structural models and estimated their quality in terms of interface RMSD between the model and the native structures (see Methods) for all FIC alignments with a certain maximum target sequence coverage q max . To avoid ambiguities caused by possible absence of parts or even all of the interface residues in partial models, the study is restricted to FIC alignments and RMSD of the binding sites atoms. Also we focused on the extreme case of q max = 40%, although modeling was performed for the alignments with q max = 50% and 60% as well, with results being qualitatively similar to those for the q max = 40%. Among the alignments considered, there were no cases for direct homology modeling where sequences of monomers in the target complex are aligned with the sequences from a template complex. The identities of aligned sequence parts in the alignments used to build the models in all cases were well below 40%, which puts them in the ''twilight'' zone of homology modeling of protein complexes [13].
There were 191 FIC alignments with q max = 40% for 26 target sequences, among which two were from antibody-antigen complexes, three from enzyme-inhibitor complexes, and the rest from the ''other'' functional group. This distribution shows no overrepresentation of functional groups compared to the entire dataset. Models were built for all 191 alignments. However, for further analysis we chose a single model per target sequence, based on the highest identity of aligned sequence parts (top model). The results are presented in Table 3. For seven target complexes (,27%) the top model had interface RMSD,5 Å , which is in line with the estimates of the binding site accuracy needed for meaningful docking predictions [19]. For five complexes, interface RMSD was between 5 Å and 10 Å , which according to the estimates of the docking funnel size [33], can produce near-native matches. Thus we define them as acceptable accuracy models of the monomers (not to be confused with the acceptable accuracy models of the complexes in the CAPRI evaluation http://www. ebi.ac.uk/msd-srv/capri). The FIC alignments were detected in 50% of the complexes with overall alignments considered unsuitable for homology modeling of monomeric proteins. Interestingly, the expectation value of the alignment does not appear to be an appropriate parameter to assess the quality of the resulting model, since in all cases the alignment for the best model did not have the lowest e-value among FIC alignments, although the lowest e-value observed for the top models alignments was 10 247 (1gxd, chain A). For 17 target sequences, the top model was found to be also the best model, i.e. model with the lowest interface RMSD. Among 9 cases with different top and best models, only in two cases interface RMSD values were significantly different (the top and the best models in different quality categories; data shown in Table 3 in bold).
The data in Table 3 indicate that all FIC alignments for the top models have low sequence and interface identity/similarity, which suggests that target and template proteins in those alignments are evolutionary remote (see discussion in previous sections). Thus, it is interesting to analyze whether there is a preference of target and template proteins in alignments to be from the same organism or from different species. Our analysis suggests no such preference since for good and acceptable models there were 6 target-template pairs from the same organism and 9 pairs from different organisms (corresponding numbers for the wrong models are 5 and 8). This does not support a conclusion from an earlier study [34] that protein-protein interactions are more conserved within one species than across the species. However a statistical analysis on a much larger pool of data is needed to reach a more definite assessment (work currently in progress). Figure 5 shows examples of the models, including those for which the target and the template sequences are from the same and from different organisms. One interesting similarity in both cases ( Figures 5A and 5B) is that the target proteins have two clearly distinguishable domains and the model structure covers a significant portion of one of the domains, which exclusively   Table 3. Parameters of the top models produced on the basis of alignments with maximum 40% target sequence coverage and full interface coverage.

Target
Template Log e (4) q, % (5) q dom , % (6) Alignment (7) Interface (8) Interface RMSD, Å (9) PDB and chain ID (1) Source organism (2) Biological function (3) PDB and chain ID (1) Source organism (2) Biological function (3) identity similarity  (1) First four symbols are the PDB code followed by ID of the chain as in the PDB file. Asterisk indicates that protein is a monomer in the PDB file. (2) As provided in PDB file. Letters in parenthesis stand for higher levels of taxonomy classification (V: viruses; A: archaea; B: bacteria; F: fungi; P: plants; M: mammals). (3) Extracted from PDB GO terms section. (4) Logarithm of alignment expectation value (e-value). (5) Entire target sequence coverage in the alignment of the model, as defined by equation (1). (6) Coverage of the target binding domain (for multi-domain structures) in the alignment of the model. (7) As defined by Eq. 3. (8) As defined by Eq. 4. (9) RMSD between C a atoms of the interface residues in the model and the native structure. participates in the interaction with the other monomer (not shown for clarity). In fact, this feature is common to all good-accuracy models (interface RMSD,5 Å ). The data on the binding domain coverage is provided in Table 3 (where applicable). It shows that there is no clear correlation between the binding domain coverage (although it is higher than the entire sequence coverage) and the model quality. Acceptable accuracy models are built for the single domain proteins as well. Figure 5C shows an example of such model. The implication for practical modeling is that if the target protein is predicted to have a domain structure, then it is likely that the accuracy of the homology models produced on the basis of the ''bad'' alignments will be sufficient to perform a meaningful template-free docking. On the other hand, for homology models of single-domain proteins, methods less sensitive to structural inaccuracies (e.g., structural alignment) should be used. This assessment is supported by a comprehensive study of the template free docking ability to tolerate structural inaccuracies [19], which showed that low-resolution structural features of protein-protein interactions can be determined for a significant percentage of complexes of highly inaccurate protein models (typically up to 6 Å RMSD from the native structure of the monomer). The results were further supported by recent studies of antibody-antigen docking of homology models, which concluded that the homology models yield medium-to-high quality of docking predictions [35]. Further confirmation came in the recent study by Aloy et al. [36] on the structural modeling of yeast interactome where it was found that the use of homology models in docking does not lead to a critical loss of accuracy (assessed by extrapolation of docking results for the unbound X-ray structures).
Our preliminary results on the benchmarking of the template free docking of the modeled structures was performed using GRAMM procedure, according to the goal of this study in the highthroughput fashion that does not involve computationally expensive scoring and structural refinement. The low-resolution criterion for success was: a match with the ligand interface RMSD,8 Å in the top 100 predictions. This RMSD value corresponds to the characteristic size of the binding funnel [33]. Such low-resolution predictions from the coarse-grained global scan are located within the binding funnel and can be further locally refined within the funnel. Higher-resolution docking, and the corresponding more strict success criteria (such as those used in CAPRI), in addition to longer computational times, require higher, non-high-throughput accuracy of the binding site modeling, which is outside the scope of this study. The current study is aimed at the models of poor quality that still preserve the acceptable accuracy of the binding site. According to the above criterion, the success rate for the modeled proteins dropped to 23% from the similarly obtained 43% for the unbound X-ray proteins. However, such success rate is significant for the genome-wide studies. A systematic assessment of docking application to modeled structures of different accuracy is currently in progress. Table 3 also includes data on the failed modeling (interface RMSD.10 Å ). Figure 6D shows an example of such model. The target native structure has the domain structure similar to the successful models described above. The main reason for the incorrect modeling of the interface region is presence of a long stretch of gaps on the template side in the alignment. This is the reason for the incorrect loop (indicated by arrow in Figure 5D),  Table 3. doi:10.1371/journal.pcbi.1000727.g005 modeled without a template in the vicinity of the interface, which resulted in position shift of the interface residues in the model compared to the native structure (yellow and blue meshes in Figure 5D). Another typical reason for large interface RMSD is the native structure interface having no secondary structure elements (e.g., a loop in enzyme-inhibitor complexes), but the fragment is modeled on a template with distinct secondary structure elements. A large difference between quaternary structures of the native target and the template structures also may lead to large shift of interface residues in the model, even if these residues belong to the same secondary structure elements as in the native structure.
Analysis of organism and functional annotations (Table 3) revealed that both target and template proteins are from the species spanning the entire universe of life -viruses, archaea, bacteria, lower (fungi) and higher (plants and mammals) eukaryotes -and participate in a broad range of biochemical processes. Moreover, there is no clear correlation between source organisms of the target-template pair or the biochemical pathways in which they participate. There are correct models with the target and the template from evolutionary distant organisms (e.g., mammals and archaea), as well as incorrect models with the target and the template from evolutionary close organisms or even the same organism. Similarly, no such correlation was found for the functions of the target and the template proteins, although the functional assignment has limited reliability. This suggests that the current ability to model complexes may not be restricted to certain species and/or functions. However, statistical analysis of a much larger protein interactions dataset, when it becomes available, would be necessary to draw more definite conclusions.

Concluding Remarks
For systematic evaluation of potential accuracy in highthroughput modeling of binding sites, local sequence alignments were performed in a representative set of protein-protein complexes. The results indicate that for the majority (97%) of the target sequences there is at least one alignment containing all residues belonging to the interface of the target complex (FIC alignments). Significant number of the FIC alignments was observed even when only ,40% of the target sequence is aligned against the template. The results suggest a simple graphical function for evaluating the probability of finding all interface residues inside a local alignment when only the alignment information is known.
Homology models of the interfaces in target monomers were built based on the FIC alignments with query target sequence coverage ,40%. A simple scheme of model ranking based on the alignment identity showed that in ,50% of cases the structural models have accuracy high enough for protein docking. Alignments that contain only a small portion of the target sequence and have low sequence identity are usually considered poor in modeling of individual proteins. They are used primarily in elaborate and computationally expensive techniques hardly applicable on genome-wide scale. Our results suggest that for the genome-wide structural modeling of protein interactions, simpler and less computationally expensive techniques based on the use of single, local sequence alignment, may yield satisfactory results, given that the interface residues are reliably identified in the alignment. Current methods for predicting protein-protein binding sites based on sequence information alone have limited accuracy (e.g. Refs. [37,38]). However, because of the on-going significant community efforts in this direction, one may expect emergence of more accurate methods in the near future.
A straightforward template-based modeling of protein complexes is possible on the basis of a co-crystallized template complex. However, previous studies [24,25] demonstrated that this technique could account only for ,15-20% of all known interactions, whereas the rest of the protein complexes have to be modeled by other techniques. One possible direction is independent modeling of individual monomers on different templates with further application of docking (either template free or based on structure alignment) to these models. Earlier studies (e.g. Refs [19,35,39] and others), as well as the results of this work suggested feasibility of this scenario. However more systematic and comprehensive studies are needed for quantitative guidelines of applicability of the homology models in large-scale structural modeling of protein-protein interactions (study currently in the progress).

Set of Proteins
Hetero-complexes with known 3D structures available in PDB were used in the study. To avoid bias caused by overrepresentation of certain protein families in PDB, we used the representative set of protein complexes from the DOCKGROUND resource [21], manually selected and purged at 30% sequence identity level. Out of 523 complexes in the dataset, we further excluded structures with multi-chain interactions and those with large structural defects in the vicinity of the interface, which allowed us to avoid ambiguities in determining binding site residues. The final set consisted of 329 two-chain non-obligate complexes shown in Table 1 (63 enzymeinhibitor, 12 antibody-antigen, 25 cytokine receptors, and 229 other complexes). This set is based on all protein structures available in PDB; thus the results are not dataset-dependent.

Software
For 658 sequences in the dataset, the search for sequence homologues was performed by PSI-BLAST [22] implemented in the program BLASTPGP. To broaden the pool of potential templates, the maximum number of hits was set to 2000, with all other parameters set to default values. To obtain the checkpoint file (the position specific scoring matrix PSSM) [22], the search was performed against all sequences in the non-redundant database of sequences (www.ncbi.nlm.nih.gov) with the substitution matrix BLOSUM62 [40] with five iterations. The checkpoint file was used in sequential PSI-BLAST run against all nonredundant sequences in PDB.
The 3D models from the PSI-BLAST sequence alignments were built by program NEST from the JACKAL package developed in Honig's lab [23] using default parameters. Large errors in some template files were repaired by the program PROFIX from the same package. The NEST program was chosen over other popular modeling programs because it yields reliable models fast enough to be used in large-scale calculations (e.g., according to benchmarking of various homology modeling programs [41]) and can be easily incorporated into automatic scripts for generating and updating databases of structural models currently under development in the lab.

Analysis of Results
Since sequence alignments produced by PSI-BLAST are local by design [22], not all residues of the target sequence are present in the alignment. Thus for the analysis of the alignments we defined the target sequence coverage and, similarly, the interface coverage Where N ali and N inter ali are the numbers of all target residues and the target interface residues, respectively, in the alignment; N tot and N inter tot are the total numbers of all residues and the interface residues, correspondingly, in the entire target sequence. We did not analyze whether the template is multi-or monomeric (although the data is available in Table 3) since our goal was to determine the general usefulness of short sequence alignments in binding site modeling, rather than traditional homology modeling of protein complexes where both target and template are multimers. When the target had the multi-domain structure, we also calculated the domain coverage q dom using formula (1), where N ali is the total number of the target residues inside the binding domain.
The alignments were further analyzed with respect to the alignment e-value as well as their identity and similarity, defined as where L ali is the length of the alignment (number of target residues in an alignment plus gaps in the aligned target sequence), N iden is the number of aligned identical residue pairs, and N pos is the number of aligned residues pairs for which substitution matrix displays a positive number (evolutionary favorable substitutions).
Similarly, the identity and similarity of the interface residues inside an alignment was defined as Where N inter iden (N inter pos ) are the number of aligned identical (positive) residue pairs where the residue on the target side belongs to the target complex interface, and N inter ali is the total number of the interface target residues in the alignment. To evaluate the quality of the resulting homology model, we calculated the rootmean square distance between C a atoms of the interface residues (interface RMSD), with the native structure of the monomer and its model superimposed by the program TM-align [42]. This measure is different from the RMSD used in the CAPRI evaluation [5], where it is calculated between the interface atoms of the ligand in the native and in the docked matches, after structural superimposition of the receptors. Other widely used modeling quality criteria, such as sensitivity and specificity, are not applicable to our study because they involve true and falsepositive/negative predictions that can be defined either for binary predictions of the fact of protein interactions (which is not the case in our study) or in the case of full modeled complex structure with both monomers present.