Identification of Local Conformational Similarity in Structurally Variable Regions of Homologous Proteins Using Protein Blocks

Structure comparison tools can be used to align related protein structures to identify structurally conserved and variable regions and to infer functional and evolutionary relationships. While the conserved regions often superimpose well, the variable regions appear non superimposable. Differences in homologous protein structures are thought to be due to evolutionary plasticity to accommodate diverged sequences during evolution. One of the kinds of differences between 3-D structures of homologous proteins is rigid body displacement. A glaring example is not well superimposed equivalent regions of homologous proteins corresponding to α-helical conformation with different spatial orientations. In a rigid body superimposition, these regions would appear variable although they may contain local similarity. Also, due to high spatial deviation in the variable region, one-to-one correspondence at the residue level cannot be determined accurately. Another kind of difference is conformational variability and the most common example is topologically equivalent loops of two homologues but with different conformations. In the current study, we present a refined view of the “structurally variable” regions which may contain local similarity obscured in global alignment of homologous protein structures. As structural alphabet is able to describe local structures of proteins precisely through Protein Blocks approach, conformational similarity has been identified in a substantial number of ‘variable’ regions in a large data set of protein structural alignments; optimal residue-residue equivalences could be achieved on the basis of Protein Blocks which led to improved local alignments. Also, through an example, we have demonstrated how the additional information on local backbone structures through protein blocks can aid in comparative modeling of a loop region. In addition, understanding on sequence-structure relationships can be enhanced through our approach. This has been illustrated through examples where the equivalent regions in homologous protein structures share sequence similarity to varied extent but do not preserve local structure.


Introduction
Comparison of protein structures is an indispensable step in understanding structure-function relationships. In most cases, first reasonable impressions on function of a protein can be generated if the protein shares high structural similarity to a protein of known function [1]. It also gives hint on evolutionary relationships [2][3][4][5]. During evolution, fold of homologous proteins are conserved even without detectable sequence similarity [6,7]. High structural similarity associated to the very low sequence similarity is indicative of either a common origin [4,6] or an independent origin with convergence to a common fold [8]. During evolutionary process, different regions of proteins are constrained differently; the regions critical for functional and structural integrity are well preserved, while the rest of the structure can diversify to accommodate insertions, deletions and substitutions [9,10].
The 3-D superimposition of protein structures obtained by using structure comparison tools is very useful in quantifying structural dissimilarity and in analyzing structural divergence. Structural comparison influences classification of proteins into protein families, superfamilies etc [11,12], i.e., they allow a complete representation of protein fold space. Hence, for a newly determined protein structure, mining the structural databases enables the identification of protein structures/sub-structures similar to the given structure [13][14][15][16][17][18]. Proteins are not rigid macromolecules and they exhibit certain degree of flexibility to allow structural variations critical for functional mechanisms [19]. Thus comparison of structures corresponding to the active and inactive states of a protein can further our understanding on the conformational plasticity of protein structures and the insights gained can improve the drug design process [20][21][22].
Alignment of proteins on the basis of their 3-D structures is more complex than sequence-based alignment as the 3-D structural information is more complex [23]. From a computational point of view, identifying the best match having least spatial distance between the maximum numbers of equivalent regions is highly expensive. Heuristics is usually added to make the problem tractable. The difficulty in aligning structures is compounded when the structures share similar secondary structures with different connectivity. In such cases, matching of equivalent regions is not sequential. Due to its utility and the difficulties mentioned before, innumerable methods to compare and align protein structures have been developed, e.g., DALI [14], SSAP [24], MAMMOTH [25], CE [26], COMPARER [27], FATCAT [28] and Matt [29]. These methods seek to find maximum correspondences between the structural elements (i.e., atoms, residues or secondary structures), and compute a similarity measure. They differ at the level of (i) representation of protein structure (points, vectors, internal distances or graphs), (ii) measure of similarity and (iii) the algorithm used for comparison, (for reviews see [30][31][32]). The comparison algorithms are varied such as dynamic programming, stochastic algorithms like Monte Carlo methods and graph theory based methods.
The algorithms can be grouped into rigid body methods, that view protein structures as rigid bodies, e.g., STAMP [33], DALI [14], CE [26] and MAMMOTH [25] and flexible methods, that connect series of aligned fragments or substructures, e.g., FATCAT [28], FlexProt [34] and Matt [29]. The structural alignments provided by flexible methods is believed to be better as they are biologically more meaningful [35].
In this work, we attempt to add a component of flexible alignment in local variable regions which are initially recognized by rigid body superposition. Here, we focus on the ''structurally variable'' (high spatial deviation) regions in the alignments of three-dimensional structures of homologous protein domains in PALI database [36]. PALI database comprises of protein families from SCOP [11]. It contains structural alignments generated using DALI [14], a well established structure comparison method and subsequently superimposed using rigid body alignment method. After such a rigid body superposition, the backbone regions with highly similar structures are evident by good overlap of Ca atoms. The structural differences in homologous proteins could be due to structural re-ordering to accommodate mutations. These differences vary from subtle variations in backbone structure to large orientation differences to accommodate substitutions especially at the core [6,[37][38][39]. Insertions are accommodated as an extension to the existing secondary structures or addition of new regular/irregular structures [37,38,40]. These insertions may either act as embellishments or promote functional diversity by presenting altered/new binding site for ligand or macromolecule [38,40].
The current study pertains to those backbone regions of homologous proteins that are not well superimposed in the rigid body superposition. These structural differences between homologues can be categorized into rigid body displacements and conformational variations. However due to rigid body displacements, an optimal superposition may not be obtained using a rigid body superposition method. Through Protein Blocks [41], a simplified representation of protein structures, we classify the variable regions into conformationally dissimilar regions and regions that share local structural similarity obscured in a global fit. In the next step we refine the alignment between homologues in PALI database, obtained through a recognized rigid body method, using match of protein blocks in the local structurally variable regions. Additionally, based on the similarity measure used, the assignment of residue-residue equivalences for a structural superposition may differ [23]. The discrepancies are higher when the Ca-Ca deviation is high. An optimal local alignment would help in the assignment of residue-residue equivalences more precisely.
For this work, Protein Blocks (PBs) [41][42][43][44] is the major tool used. They represent a higher level abstraction of protein backbone conformation. This is a set of 16 prototype conformers, denoted from a to p, which approximate the local protein structure with an average root mean square deviation of 0.42 Å . Protein Blocks have been used in comparison of protein structures [41,45] and database mining [46]. PBs have been found to be useful in prediction of short loops [47]. Protein blocks approach has also been used to build trans-membrane protein structures [42], to design peptides [48], to define reduced alphabets for designing mutants [49], to analyze protein contacts [50], to find structural motifs across protein families [18] and to identify Mg 2+ binding sites in proteins [51].

Results and Discussion
Superimposed proteins from PALI database have regions of correspondence that exhibit high structural deviation, namely ''Structurally Variable Regions'' (SVRs). These regions may appear ''structurally variable'' (not well superimposed) in a global context but may exhibit local conformational similarity. For example an a-helical region in a protein might correspond to an ahelical region in the homologue; however if the helical regions in the two proteins are in slightly different orientations they may not appear superimposed if the two structures are superimposed as a whole. Using PB Substitution Matrix (SM) coupled with CLUSTALW [52] alignment approach, SVRs were re-aligned to seek an improvement in the local alignment for these regions (see Materials and Methods section). We investigated the differences in alignments obtained after employing protein blocks approach (aSVRs -''a'' stands for ''after'') and alignments before employing the approach (bSVRs -''b'' stands for ''before''), to evaluate our protocol in revealing similarities not identified using a global rigidbody superposition method. For this purpose, we compared the two alignments, referred to as bSVRs and aSVRs in the rest of this paper, based on PB scores and values of root mean square deviation (rmsd) or a similar measure, Structural Distance Metric (SDM). An improvement in the values for these two parameters for aSVR would reflect an improvement in the alignment obtained using PBs. A total of 347,062 Structurally Variable Regions (SVRs) and 542,610 Structurally Conserved Regions (SCRs) were identified in the PALI database (Refer Materials and Methods).

Distribution of scores
Re-alignment of PB sequences of SVRs change the alignment scores. Figure 1 shows the distribution of normalized score for aligned pairs (SAP score) obtained for SCRs, bSVRs and aSVRs. A normal distribution of scores was observed. Using two-sided Kolmogorov-Smirnov statistic, the p value for each of the three distributions is less than 2.2e-16. As expected, the values for SAP are higher in SCRs as compared to bSVRs indicating higher structural similarity in SCRs compared to bSVRs. However, compared to bSVRs, a significant shift of SAP values towards higher scores was observed for aSVRs (p value ,2.2e-16; Paired student t test, see Figure 1). An analysis of the difference in SAP values for aSVRs and bSVRs indicates an improvement for 56% of SVRs and a decrease for 13% of SVRs. The scores remain unchanged for the remaining 31% of the alignments. The trend for the distribution of scores for complete alignment (SCA) was similar to SAP scores; 59% aSVRs scored higher and 14% scored lower than bSVRs (see Text S1 and Figure S1). SCA and SAP scores have reasonable correlation in the two scores for both bSVRs and aSVRs. A shift could be observed towards higher scores; 55% of aSVRs scored above -1 for SCA and SAP measurements as opposed to 41% of bSVRs, i.e., an increase in number by 14% (see Text S2 and Figure S2). Thus, an increase in the scores after realignment indicates that the regions concerned are more similar in terms of conformation than previously represented in the PALI database. Both measurements show improvement indicating better alignment of SVRs.
Analysis of the distribution of SCA and SAP allowed us to define a cutoff score of 20.42 to distinguish SVRs as conformationally similar and dissimilar (see Figure 1). The cutoff was chosen such that 90% of the scores corresponding to the structurally conserved regions score above this threshold. Based on this cutoff, 53% of the bSVRs and 74% of aSVRs were classified as conformationally similar, i.e., an increase by 21% (45,343 SVR segments). Thus, through our approach we have been able to identify local structural similarity in a substantial number of SVRs which was not known from classical approach.

PB substitutions in bSVRs and aSVRs
An improvement of scores is observed after re-alignment. This increase is due to a higher number of PB-PB equivalences (i.e., number of PB aligned with another PB and not a gap), and/or a change in the nature of PBs aligned at various positions in the alignment. 76% of SVRs showed no change in the raw number of correspondences after re-alignment. On an average, for each segment, 0.5 more PB is aligned with a PB in aSVRs as compared to bSVRs (see Figure S3). Figure 2 shows the difference in the distribution of PB-PB substitutions between bSVRs and aSVRs. Alignment of identical PBs (i.e., diagonal elements of the plot) is increased for each PB. Among the non-identical substitutions, the highest increase has been observed for the alignments of PB f (C cap b strand), PB k and l (loop to N cap a helix). A lower increase was observed in the alignments of PB a and c (N cap b strand), PB d (b strand), PB m (a helix), PBs n to p (C cap a helix) and PB h (loop). The alignments of PB m with each PB types except itself shows a drop, the highest decrease being in the alignments with PBs d, f, k, l and n. A lower decrease was observed for the alignment of PB a and c with PBs corresponding to loops, PB d with PBs corresponding to loops and capping regions of a helix, PB f (C cap b strand) with PBs corresponding to N cap b strand and loops, PB k with PBs corresponding to N and C caps of b strand and PBs l with PBs c, e, k and p. In general, this decrease concerns unrelated or dissimilar PBs and the increase is mainly observed in highly similar or identical PBs. The increase in the number of equivalences for PBs corresponding to N cap and C cap regions of helices and strands as seen in Figure 2 suggests an improved alignment of these regions. Similar conclusions were drawn from the plots generated for data sets corresponding to various SCOP classes.
Hence, the major contributing factor for the increase in scores is the change in the type of equivalences rather than an increase in the number of correspondences. In fact only 30% of SVRs share more than 95% of the equivalences. In the rest 70% of SVRs (see Figure S4A) the percentage of equivalence is shared to varied extent. Nevertheless, a common PB pair found in the two alignments could in fact come from different regions in the sequences. 40% of SVRs have undergone changes in the alignments to form new equivalences although the PB-PB equivalences are preserved, while for 40% SVRs, the equivalences are retained in the alignment (see Figure S4B).
A comparison of the difference in percentage of gaps between bSVRs and aSVRs ( Figure S5A) shows that 76.6% of the alignments have no change in the number of gaps. A decrease in the percentage number of gaps has been observed for 18.4% SVRs and an increase is seen in 5.0% of SVRs. Although, a decrease in the percentage of gaps is indicative of higher similarity in terms of lengths of the protein structures aligned, the introduction of gaps is sometimes favored as it reduces the number of equivalences of dissimilar PBs. Another interesting parameter compared was the number of gap openings in the alignments. An accommodation of insertions and deletion would require a re-adjustment in protein structures. We would expect fewer insertion and deletion events during protein evolution to preserve the three dimensional structure and thus intuitively less number of gaps interspersed in the alignments especially in the middle of helices and strands [37,38,53]. The difference in the number of gap openings in aSVRs as compared to bSVRs is not significant (mean value equals to 20.38) (see Figure S5B). Nonetheless, some examples were observed where gaps in the stretch of aligned PBs corresponding to a-helix and b-strand are eliminated in aSVRs (see the section below).

Analysis of SVRs
Local structural similarity could be identified in terms of PB sequence similarity. We have also analyzed it by comparing SDM of bSVR and aSVR alignments. Profit software [54] was used to perform the superimposition. Rmsds obtained from these superimpositions were converted in SDMs (Structural Distance Metric) [55,56] (see Materials and methods section).
With a global rigid body protein structure superposition, regions corresponding high deviations usually correspond to regions (i) that are spatially displaced although being structurally similar or (ii) with genuine difference in local conformation. Protein Blocks approach can distinguish these two scenarios. Indeed, a rigid body displacement of a local region after superposition would result in a high PB score and low SDM for the aligned regions. However, when the regions are conformationally distinct, the PB score would be low and SDM would be high.  The results on the assessment of approach in terms of SDM and PB scores have been tabulated (see Table 1). On an average 36.2% of SVRs showed an improvement in SDM values. For 32.7% no change has been observed while for 31.0% a decrease has been observed. In the last category of cases, often superimposition is not relevant as the mean PB scores for these SVRs is 0.02 after realignment (20.49 before re-alignment). 28.9% of the SVRs in the dataset showed an improvement both in PB scores as well as SDM values. Improvements were due to re-alignment of segments which were displaced/oriented differently in previous alignments. Figure 4A shows an illustrative example highlighting improved alignment of an a-helix displaced in alignment obtained by superposition of gross structures. The figure on the left shows a superposition of SVR segments based on alignment obtained by using DALI. Superposition of the segments obtained after realignment is shown on the right. Below each superposition, are   shown the alignments, PB scores and SDM values for alignments, bSVRs and aSVRs. With appropriate placement of gaps, PBs k, l, m, n, o, p and a are aligned in aSVR thus identifying local structural similarity previously unknown. Similarly, a b-strand oriented differently in the homologue could be aligned with a lower SDM using PB approach as shown in the Figure 4B. Figure 4C shows the superposition of regions corresponding to loops. Although the loop conformations are identical, as indicated by the identical PBs in the two structures, the SDM is high due to difference in the orientation (superposition on the left). An optimal superimposition and residue-residue equivalences could be obtained using PB approach (superposition on the right). As exemplified above, the local structural similarity was unidentified previously due to rigid body shifts. In other cases, improvements were observed in alignment of segments with differing lengths but with local structural similarity. The region of similarity was found to lie at either ends in the alignments or in the middle of alignment flanked by gaps. Figure 4D shows an example of a region similar at one end. Moreover, the continuity in helix in the new alignment is evident bringing the PBs f, k, l and series of m in the two sequences in register. As mentioned in the previous section, insertions and deletions in the middle of a helix or a strand are tolerated to a lesser extent as compared to the rest of the structure. Through our approach, gaps in the middle of a helix or a strand have been reduced/eliminated. Figure 4E shows an example of a region of local similarity in the middle of alignment. The PBs a, c and d correspond to a small strand with a transition to coil-like region denoted by PBs k and l in the variable segment of CD4 glycoprotein (PDB code: 1cid, chain A; shown in red; [57]). This region is aligned with the C terminal end of the b-strand in the homologue (in blue). The example highlights an improvement in the capping region of b-strand transiting to coils. The improvement in scores could also be attributed to a decrease in the equivalences of PBs corresponding to PB m (i.e., helical state) with PBs associated to strands and capping regions (i.e., PBs d, b, c and f,) as illustrated in Figure 4F. An alignment of a helix and a strand is meaningless in structural context as these regions, though equivalent in homologous proteins, do not share structural similarity. Hence the alignment of PBs corresponding to helices and strands would be insignificant. 32.7% of the alignments showed no difference in PB scores and SDM. The plausible reasons are the already existing optimal equivalences in SVRs which could not be improved further using PB approach and/or the regions that are aligned are conformationally different. Equivalences were preserved in majority of aSVRs. One quarter of these alignments correspond to conformationally different segments (according to the cutoff determined previously, see the first section in Results and discussion). An example where the scores for aSVRs fall below the cutoff and the sequences aligned are conformationally different is presented in one of the subsequent sections.
21.8% of the aSVRs have better PB scores but SDM value differences were slightly higher (3.7Å on average) than bSVRs. 25.5% of these SVRs correspond to conformationally dissimilar regions based on the cutoff previously determined; hence such regions cannot be superimposed well. In general, changes in PB-PB equivalences were observed due to re-distribution of gaps which improved the scores; however this increase is not reflected in SDM values. The short stretches of local similarity in segments of overall different conformations led to an increase in the PB scores but with a slight increase in SDM values due to poor similarity in the remaining segment presenting complex cases of superposition. This has been explained though an example illustrated in Figure 4G. The PBs k, l and m are aligned in aSVR, hence improving the score though the remaining segment shares low similarity. A similar observation can be made from the example in Figure 4H. The PBs a, c, f and k align in the aSVR and improve the score. Therefore where the conformations of the segments superposed are very different with similar region being very short, overall PB score may improve but the SDM values may increase slightly.
In contrast to the above scenario, 5.6% of the aSVRs have a lower PB scores but improved SDM values. 42.02% of these segments aligned are conformationally distinct. In the remaining cases, a redistribution of gaps led to different equivalences. Here, the mean difference in SDM is 216.77 for conformationally similar segments (SAP .20.42). Small regions of similarity are preserved while the rest of alignment undergoes a change in equivalences. In certain cases, this results in an improvement of overall superposition but a decrease in PB scores. An example is presented ( Figure S6A). The region of alignment of identical PBs is small (PBs k and l at the C terminal end). The rearrangement of PB equivalences in the remaining region decreases the score.
For 14 cases (0.023%) of alignments, no differences in SDM values were observed but a difference in scores for aligned PBs was found. For 12 out of 14 cases the PB score improved and for 2 cases PB score did not improve. 21.43% of these segments exhibit conformational dissimilarity. The mean difference in scores for the remaining segments corresponds to 0.52. The change in scores indicates change in PB-PB equivalences. In twothirds of the cases, number of equivalences before and after realignment remains same without a change in overall atomic superposition. In the remaining one-thirds, the number of equivalent PBs (or % gaps in the alignment) has changed without changing the SDM values.
More surprisingly, for a limited number of cases, i.e., 3.2% of SVRs, no difference in PB scores were observed, but a change in SDM was seen. It is a consequence of new equivalences without a change in the nature of PBs aligned, which improved the SDM in 1.7% of these SVRs but did not improve in the rest 1.5% of SVRs. Finally, 7.7% of the aSVRs showed a decrease both in PB scores as well as higher deviation at Ca positions. It is mainly due to repeats of PBs which leads to the possibility of alternate alignments analogous to alignment of low complexity regions in amino acid sequences. The local similarity is observed at the ends of the alignment. PBs at the end come close while eliminating the gap which results in poor SDM and poor score (see Figure S6B). Structural alphabets m and f are repeated. As a result, many alternate alignments are possible. The PBs c, f and k in the segment of protein cytochrome P450 (PDB code: 1io7, chain A; [58]) could align to PBs c, f, b or d, f and k.
The application of the approach in modeling loop regions and in analyzing structure-function relationships has been discussed in the next two sections.

Loop modeling
One of the most challenging tasks in comparative modeling [59,60] is to obtain an accurate model of protein loops as they often hold the functional site [61,62]. Errors in modeling loops are high as they are structurally variable regions and may not be conserved even among the closely related proteins [63,64]. Modeling loop regions is difficult as the conformations also depend on length of the loop and certain key residues [65]. If the sequence similarity among the homologues is low or the regions are variable in length, the problem is compounded. Additionally, the number of geometrically possible loop conformations increases exponentially with loop length. Consequently, it becomes a daunting task to obtain an accurate model of loop regions. The conformation of a loop can be predicted by identifying a loop template from homologous structure or by searches in databases of loop conformations of various lengths obtained from known three-dimensional structures [59,[66][67][68]. It has been shown previously that the modeling of loops is more accurate if a homologue is used as one of the templates [69]. However, finding a homologue as a template for loop modeling is not always possible and in most cases a template is obtained from database search. The alternate approach, ab initio modeling of loop region is based on the potential or scoring function and works best for short segments [59,[70][71][72].
Having known that loop modeling is non-trivial and is most accurate when the equivalent regions are obtained from homologues, we have explored the use of information on local conformation through representation of templates as Protein Blocks in obtaining clues on comparative modeling. This has been exemplified through modeling exercise of a segment of Alpha-larabinofuranosidase protein (from Bacillus stearothermophilus, PDB code 1qw9 [73]) using Beta-D-xylosidase structure (PDB code 1w91 [74]) as the template which share overall sequence identity of 7.7% with the target. A number of models (100 each) were generated using Modeller 9v7 with classical approach [60] based on alignments from bSVRs and aSVRs of the target and template sequences. Figure 5A shows the variation in rmsd values for various models with respect to the template structure. Rmsd values are lower when models are generated based on the equivalences from aSVRs (red, Figure 5A) as compared to bSVRs (blue, Figure 5A). This indicates an improvement in models of the target segments when new equivalences based on PB approach were used. The two alignments: bSVR and aSVRs along with the corresponding PBs are shown in Figure 5B. For further analysis, the best models having lowest rmsd with respect to the template structure from each set of 100 models were selected (model 8: lower plot for aSVRs; and model 74: upper plot for SVRs). Figure 5C shows the superposition of the modeled segments with the known crystal structure (green) based on the alignment from bSVRs (blue) and aSVRs (red), respectively. The model generated based on the equivalences from PB approach produces lower rmsd (1.98 Å ) when superposed on the crystal structure as compared to the model generated using the original approach (rmsd: 3.50 Å ). Understanding sequence-structure relationships Sequences of homologous proteins may evolve and diverge beyond recognition by simple homology searches. Usually, the extent of difference exhibited by sequences is higher compared to structures. In this section we show how the current analysis of consideration of PB-based alignment of SVRs can be taken to the next level of understanding of sequence-structure relationships.
Here we present two examples where the local structures are very different in the pairs of homologous protein structures. Figure 6A shows the superposition of 3-D structures of two homologous elongation factors 1d2e [75] (from cow) and 2c78 [76] (from Thermus thermophilus) belonging to SCOP family c.37.1.8. The regions, encircled in Figure 6A are identical in terms of amino acid sequence but adopt very different structures. The PB score after optimal PB-based alignment of SVRs (aSVR) is 22.07. Figure 6B shows an example of homologous protein structures (PDB code 1nkr [77] and 1cvs [78]; SCOP family b.1.1.4, I-set domains) with poor sequence and structural similarity in a local region. Although the rest of the structures superimpose well, regions encircled in Figure 6B have very different local structures. The PB-score after the optimal PB-based alignment of SVRs is -2.07. As illustrated in two examples above, the regions are conformationally different.
In the example of elongation factors shown in Figure 6A one might expect almost identical structure for the local regions with identical sequences of two closely related proteins However the PB-based alignment of SVRs shows that this is not a spatial difference of conformationally similar SVRs. Indeed the low PBscore indicates very different conformations of identical amino acid sequence regions. In fact the extent of conformational difference between SVRs of homologues is comparable to that shown for another pair of homologous proteins in Figure 6B where the amino acid sequences in SVRs is very different [79]. Thus PBbased alignment of local regions (SVRs) are very helpful in cautioning us on unexpected structural differences even among ''equivalent'' SVRs of homologous proteins with highly similar or even identical amino acid sequences. Further, the example of elongation factor suggests that prediction of secondary structures based on sequence composition and sequence similarity to a 'homologue' should be exercised with caution. Such conformational differences are often possible in the functional regions of homologous proteins when the homologues are crystallized in different functional forms such as active and inactive forms of enzymes.

Conclusions
In the current work, we have presented a refined view of the regions of homologous protein structures that exhibit apparent high deviation on global structural superposition. When the deviation is high, the equivalences assigned through atomic superimposition are inaccurate. Through representation of protein structures as PB sequence, conformational similarity could be identified for 159,780 (74%) variable segments, based on PB scores, an increase by 21%, compared to a classical structural alignment approach in the database of structurally aligned homologous protein structures. The improvement was also reflected in the lower SDM in 3D superposition based on new equivalences after re-alignment of SVRs. The equivalences could be refined for the capping regions of helices and strands and loops. Regions of high similarity could be located in homologous pairs of protein structures even when the aligned regions were of different lengths. Also segments which were spatially displaced could be identified and aligned efficiently. All these cases have been explained through appropriate examples. For the cases where the approach does not perform as well, the best (most optimal) alignment can be chosen based on the global context in the protein structures following the principles governing protein structure; for example, regions flanking the variable segment could be considered. The best alignment could be the one with continuous helix or strand uninterrupted by gaps in the alignment. The approach can be used in identifying equivalent regions in homologous structures that do not share structural similarity and in the understanding of sequence-structure relationship. It can aid in providing clues to model loops for which homologue of similar length is unavailable. The approach can be extended to understand the effect of amino acid substitutions on the local structural alterations in the homologous protein structures. As the approach is quite general, it can be used in conjunction with any structural alignment algorithm.
With an improvement in structural alignments which are central in understanding of protein structure-function and evolutionary relationships, the applications of the approach are manifold. The approach can be extended to refine regions of high deviation obtained using simultaneous superposition of multiple protein structures. The method can be improved by using gap penalties specific to PB types with respect to major secondary structures. In the near future, we propose to develop a web server based on our refinement approach. This comprehensive data set on homologous structures would serve as a valuable resource to study the extent and nature of alterations/structural rearrangements in backbone conformation of homologous structures as a consequence of substitutions (conservative as well as non conservative) and indels during the course of evolution.

Protein Data set
The protein data set was obtained from PALI [36] (Phylogeny and Alignment of homologous protein structures) v2.7 database which contains structure-based sequence alignments for protein domain families defined by SCOP database (v 1.73). The data set of 74,705 pairwise alignments, generated through DALI [14] software followed by rigid body superimposition, correspond to 1,664 protein domain families. The structural alignments were analyzed to identify topologically equivalent and non equivalent residues. A stretch of three or more contiguous residues with Ca-Ca deviation at every position lower than 3.0 Å is considered as topologically equivalent segment or Structurally Conserved Region (SCR). The other regions are considered Structurally Variable Regions (SVRs). This rule, classically used in PALI, categorizes regions as Structurally Conserved Regions or Structurally Variable Regions. Based on this criterion, 542,610 SCRs and 347,062 SVRs were identified. These SVRs correspond to 49% of the alignment positions in the data set. Of these, 215,920 complete SVRs with more than three aligned PBs have been considered for further analyses. Our entire analysis is confined to alignment of SVRs.

Protein Blocks
Protein Blocks (PBs) correspond to a set of 16 local prototypes, labeled from a to p (see Figure 1 of ref [43]), of 5 residues length based on (w, y)dihedral angles description. They were obtained by an unsupervised classifier similar to Kohonen maps [80] and hidden Markov models [81]. The PBs m and d can be roughly described as prototypes for central a-helix and central b-strand, respectively. PBs a through c primarily represent b-strand N-caps and PBs e and f, C-caps; PBs g through j are specific to coils, PBs k and l to a -helix N-caps, and PBs n through p to C-caps. This structural alphabet allows a reasonable approximation of local protein 3D structures with a root mean square deviation (rmsd) now evaluated at 0.42 Å [42,43]. PBs have been assigned using inhouse software. It follows rules similar to assignment done by PBE web server (http://bioinformatics.univ-reunion.fr/PBE/) [41].

Re-alignment of structurally variable region
To re-align SVRs in quest of improvement of alignments, we have adapted our previous approach [41,45]. We had proposed a PB substitution matrix (PB SM) similar to a matrix used for sequence alignment. A novel refined version of PB SM optimized for mining databases and improving the alignment quality has been generated (Joseph et al., submitted). In this work, we have used the refined PB SM coupled with classical CLUSTALW approach [52] to realign protein structures. The parameters used in CLUSTALW were tuned to make it specific for PBs instead of amino acid residues. All residue-specific and position-specific gap penalties were turned off. A range of gap penalty values were evaluated systematically for generating alignments. Finally, a gap opening penalty of 10 and a uniform gap extension penalty of 0.2 were chosen based on the alignment scores. It must be noted that PB substitution matrix values were scaled between 0 and 10 to make it compatible with the alignment software. Newly aligned SVRs are named aSVR while previous alignments are named simply bSVR.

Calculation of alignment scores
To evaluate the quality of new alignments of SVRs over the previous alignments, scores were computed for both alignments. Two scores were calculated for each alignment, based on inclusion or exclusion of gaps in the alignment. Calculation of these scores would reflect the differences in two alignments of a pair of segments in terms of the substitution of PB at an alignment position as well as the lengths of the alignment. The aligned PB positions were scored based on the values from PB SM. Summation of these values was normalized by the number of PB pairs to compute the Scores for Aligned Pairs (SAP) for an alignment. To calculate scores for complete alignment (SCA), including gaps, every alignment position with a gap was given a score of -3. The scores were normalized by the length of the alignment.

Calculation of SDM values
To assess the improvement of alignments using our approach, SDM of SVR before and after re-alignment were compared. PROFIT software [54] was used to calculate rmsd values. The SVRs corresponding to N and C termini were removed from the analysis. 67,234 SVRs were considered for this analysis. Rmsds for the remaining SVRs were converted into structural distance metric (SDM) as proposed by Blundell and coworkers [55,56]. Suitable modifications have been done in SDM calculations to make it suitable for the data we present here. i.e., in RMS calculation instead of dividing rmsd by 3.0 we are dividing rmsd by the highest rmsd (24.97) from all the alignments for SVRs to get the values of RMS in the range of 0 to 1. Figure S1 The distribution of scores for bSVRs (A and C) and aSVRs (B and D). Text S1 Comparison of the distribution of SAP and SCA scores in bSVRs and aSVRs.

Supporting Information
(DOC) Text S2 Correlation of SAP and SCA scores in bSVRs and aSVRs. (DOC)