CAB-Align: A Flexible Protein Structure Alignment Method Based on the Residue-Residue Contact Area

Proteins are flexible, and this flexibility has an essential functional role. Flexibility can be observed in loop regions, rearrangements between secondary structure elements, and conformational changes between entire domains. However, most protein structure alignment methods treat protein structures as rigid bodies. Thus, these methods fail to identify the equivalences of residue pairs in regions with flexibility. In this study, we considered that the evolutionary relationship between proteins corresponds directly to the residue–residue physical contacts rather than the three-dimensional (3D) coordinates of proteins. Thus, we developed a new protein structure alignment method, contact area-based alignment (CAB-align), which uses the residue–residue contact area to identify regions of similarity. The main purpose of CAB-align is to identify homologous relationships at the residue level between related protein structures. The CAB-align procedure comprises two main steps: First, a rigid-body alignment method based on local and global 3D structure superposition is employed to generate a sufficient number of initial alignments. Then, iterative dynamic programming is executed to find the optimal alignment. We evaluated the performance and advantages of CAB-align based on four main points: (1) agreement with the gold standard alignment, (2) alignment quality based on an evolutionary relationship without 3D coordinate superposition, (3) consistency of the multiple alignments, and (4) classification agreement with the gold standard classification. Comparisons of CAB-align with other state-of-the-art protein structure alignment methods (TM-align, FATCAT, and DaliLite) using our benchmark dataset showed that CAB-align performed robustly in obtaining high-quality alignments and generating consistent multiple alignments with high coverage and accuracy rates, and it performed extremely well when discriminating between homologous and nonhomologous pairs of proteins in both single and multi-domain comparisons. The CAB-align software is freely available to academic users as stand-alone software at http://www.pharm.kitasato-u.ac.jp/bmd/bmd/Publications.html.


Introduction
During evolution, protein structures are more highly conserved than nucleotide sequences [1,2]. Therefore, comparing protein structures is a fundamental approach for homology detection, classification, and functional annotation for novel protein structures [3][4][5]. In particular, structural alignment, which assigns amino acids that are equivalent among proteins, is very useful for detecting functional sites and conserved positions. Many protein structure alignment methods have been developed in the past two decades. Most of these alignment methods aim to find the best alignment yielding the maximum number of equivalent amino acids between proteins with minimal structural deviation following three-dimensional (3D) coordinate superposition. These methods treat protein structures as rigid bodies and are categorized as "rigid-body alignment" approaches; these include DALI, FAST, CE, MAMMOTH, TM-align, and Fr-TM-align [6][7][8][9][10][11]. The superposition of protein structures in Cartesian space, such as RMS fitting, is sufficiently fast and accurate [12]. Thus, rigid-body alignment is widely used to find remote homologs when only proteins with low shared sequence identities are available.
However, proteins are known to be flexible, and this flexibility has an essential role in their functions, such as catalysis, protein-ligand interactions, and protein-protein interactions [13][14]. Rigid-body alignment methods cannot find correct alignments for proteins that undergo structural changes (almost all proteins exhibit small/large movements), and thus they fail to identify structural similarities in flexible regions. To overcome these issues, a flexible protein structure alignment approach has been developed called FATCAT [15], which finds the optimal structure alignment with the least number of rigid body movements using a dynamic programming (DP) algorithm to connect aligned fragment pairs (AFPs). In the DP algorithm, FATCAT uses the score calculated from the RMSD of AFPs and the number of twists. Therefore, FATCAT depends on the 3D coordinate superposition of AFPs. In addition, a flexible and translation/rotation-invariant alignment method has been proposed that does not use 3D coordinate superposition, which is called maximum contact map overlap (CMO) [16]. In CMO, the protein structure is represented by residue-residue contact maps, which are defined by the Euclidean distance between the representative coordinates in the corresponding amino acids. The CMO algorithm has been studied widely. In particular, Andonov et al. [17] proposed an exact CMO algorithm using integer programming and Lagrangian relaxation. Wohlers et al. [18] proposed an approximated CMO score and reported its protein structure classification performance. Moreover, GR-Align [19] is a fast CMO heuristic method based on generalized graphlets and the graphlet degree-to-order graph. GR-Align is at least 79 times faster than TMalign according to the reported results.
Recently, it was shown that residue-residue contacts have strong relationships with correlated mutations in multiple sequence alignments [20]. It was also observed that residue-residue contacts guide protein folding, and they are highly informative for fold recognition [21][22][23][24]. Therefore, we considered that the evolutionary relationships between proteins should reflect residue-residue physical contacts directly, rather than 3D coordinates. Thus, the assembled residue-residue physical contacts should reflect a protein fold represented in terms of 3D coordinates.
Based on the same concept, Olechnovic et al. introduced a contact area difference (CAD)based score for evaluating the structural similarity between a protein model and the native structure [25], where they proposed the use of the residue-residue contact area as a residueresidue physical contact. Thus, the CAD-score is an extended algorithm based on CMO methods. They showed that the CAD-score is a more robust evaluation score than the global distance test total score [26], which is based on 3D coordinate superposition. The CAD-score was also shown to be robust when assessing the accuracy of protein models for multidomain and protein-protein complexes, as well as single-domain proteins. The CAD-score is essentially unaffected by the domain arrangement, and it can be applied to the flexible alignment method.
In this study, we developed a new protein structure alignment method called contact areabased alignment (CAB-align), which uses the similarity of the residue-residue contact area. The main aim of CAB-align is to identify homologous relationships at the residue level between related protein structures. CAB-align comprises the following three main steps. First, CABalign employs a rigid-body alignment method based on local structural similarity. Second, structural alignment is performed based on global 3D structure superposition to generate a sufficient number of initial alignments. Finally, a heuristic method (iterative DP) is executed based on the modified CAD-score to obtain the optimal alignment.
In various benchmarks for protein structure alignment, the alignment quality (AQ) is defined based on comparisons with manually curated gold standard alignments or using geometrical similarity measures. In general, gold standard alignments are used, such as in SISY-PHUS [27], the Conserved Domain Database [28], and the Homologous Structure Alignment Database [29]. In these databases, experts consider both the geometric superposition and sequence-based alignment, as well as manually resolving any conflicts between them. There are also many types of geometric similarity measures, such as the similarity index, structural alignment score, match index [30], and template modeling score [31]. These measures achieve a balance between the alignment coverage and geometrical deviation after optimal superposition. Therefore, the superposition of 3D coordinates remains an essential component of structure alignment methods and AQ evaluations.
However, as mentioned above, our CAB-align method allows flexible protein structure alignments where proteins are flexible. Therefore, we could not use geometric similarities that employ superposition based on Cartesian coordinates to assess the performance of this alignment method, we employed four main evaluation approaches: (1) agreement with the gold standard alignment, (2) AQ based on an evolutionary relationship without the superposition of 3D coordinates, (3) consistency among multiple alignments, and (4) agreement with the gold standard classification.
We compared the performance of CAB-align with the HHalign sequence alignment method and three representative structure alignment methods that each use different algorithms: rigidbody alignment (TM-align), flexible alignment (FATCAT), and the residue-residue distance matrix-based method (DaliLite_v3.3 [32], a standalone version of the DALI server).

Results
We compared CAB-align with three state-of-the-art structure alignment methods, i.e., TMalign, FATCAT, DaliLite_v3.3 (denoted as DaliLite in this study), and the HHalign sequence alignment method (except for the quality of alignment), using three evaluation approaches: quality of alignment, alignment consistency, and agreement with the SCOPe classification [33].

Training and Benchmark Datasets
We used four datasets for training and evaluation: SISYPHUS_ID10, SCOPe_FAMILY, SCO-Pe_NR10, and PDB30. SISYPHUS_ID10 is a subset of the SISYPHUS dataset. SISYPHUS contains manually created structural alignments for protein pairs with nontrivial structural relationships. SISYPHUS_ID10 contains 1,627 alignments with sequence identities of <10%. SCOPe_FAMILY (2,623 domains in 903 superfamilies and 591 folds) is a subset that represents each FAMILY in the SCOPe database (Structural Classification of Proteins-Extended; Release 2.03, Oct. 2013). SCOPe_NR10 (3,542 domains in 2,160 families, 897 superfamilies, and 587 folds) is a subset with <10% shared identity from SCOPe. The lists for SCOPe_NR10 and SCOPe_FAMILY were taken from the SCOPe website (http://scop.berkeley.edu/). We then excluded low-quality structures (SPACI score [34] <0.4), NMR structures, and irregular structures, which had an atom assigned to multiple coordinates from SCOPe_FAMILY and SCO-Pe_NR10. In SCOPe, each protein structure is split into a domain unit and then classified according to structural and evolutionary relationships. Therefore, the impact of significant protein flexibility, such as domain arrangements, is limited in these two datasets. PDB30 is a subset of the protein data bank (PDB [35], 14 Mar 2014) with <30% shared identity. This dataset includes single-domain proteins, multidomain proteins, and all types of protein flexibility.

Evaluation Criteria
AQ. First, we evaluated the alignments using the gold standard alignments in SISYPHU-S_ID10 as the reference. We defined two criteria of agreement and reliability as follows: where N c is the number of correctly aligned positions, L ref is the length of the reference alignment, and L ali is the length of the alignment under evaluation. Positions with gaps were excluded. Second, we defined correctly aligned positions based on comparisons with the reference alignment, which reflected the evolutionary relationship. To generate the reference alignments, we used the HMM-HMM alignment program (HHalign in HH-suite 2.0 [36]), which compares two hidden Markov models (HMMs). The HMM profiles were generated using hhblits (included in HHsuit 2.0) against the Uniprot20 database (included in HHsuit 2.0). In this study, the alignments obtained from HHalign were not considered to be perfect, but they provided reliable confidence estimates for each aligned position. According to the description of HHblits, the confidence values obtained for each position by HHalign are highly correlated with the accuracy of the aligned positions. The confidence values are calculated by comparing the sequence profiles of the aligned positions. The E-value estimated by HHalign shares high similarity with the observed E-value. To assess the quality of the alignment, we defined AQ as the fraction of correctly aligned positions with estimated confidence values not less than a given threshold, as follows: where CR conf!x and N conf!x are the number of correctly aligned positions and the number of positions with confidence values not less than the given value of x in the reference alignment, respectively. Consistency of multiple alignments. According to Sadowski and Taylor [37], we assessed the consistency among triplets of alignments, such as three alignments between A-B, B-C, and C-A for proteins A, B, and C. In a consistently aligned position, the following condition: is true, where E(A i , B j ) denotes that position i in protein A is aligned with position j in protein B. To evaluate the consistency, we used the coverage (Cov) and relative rate (Rate) of the consistent position, which are defined as follows: where N consist is the number of consistent positions among the triplet alignment, N A is the number of residues in protein A, and AL A,B,C denotes the number of positions that are commonly aligned regions among the three alignments. Positions with gaps in any of the three alignments were excluded. Agreement with SCOPe classification. We assessed the ability to discriminate between homologous and nonhomologous pairs of proteins based on the agreement with SCOPe classifications. We used the receiver-operating characteristic (ROC) curve and precision-recall curve (PRC) analysis. The ROC curve plots the recall against the false-positive rate (FPrate). The PRC plots the precision (or reliability) against the recall (or coverage). The precision, recall, and FPrate for a given threshold s are defined as follows: where TP s is the number of true positive pairs that are correctly classified as belonging to the same class (e.g., the same family, superfamily, or fold in SCOPe) based on the threshold s, FP s is the number of false-positive pairs that are incorrectly classified as belonging to the same class based on the threshold s, TN s is the number of true negative pairs that are correctly assigned as belonging to a different class based on the threshold s, and FN s is the number of false-negative pairs that are incorrectly classified as belonging to a different class based on the threshold s. We used the area under the ROC curve (AUC) and the area under the PRC (AUPRC) as measures of agreement with the SCOPe classification. The AUC corresponds to the probability that the proposed structural similarity score will rank a randomly selected domain pair that belongs to the same class as higher than a randomly selected pair that belongs to a different class. The AUPRC corresponds to the average precision of the proposed structural similarity score.

Comparison of AQ
SISYPHUS_ID10 benchmark dataset. Table 1 shows the average agreement and reliability results for SISYPHUS_ID10 (1627 pairs) using five alignment methods. We also calculated the similarity score S (Eq 11) and normalized similarity score (NormS) (Eq 16) with the CABalign scoring function. As shown in Table 1, DaliLite and CAB-align outperformed the other methods in terms of agreement and N c . In addition, these two methods generated alignments with the best and second best S and NormS on average. These results suggest that the highquality alignments tended to have a high similarity score S. Fig 1 compares the performance of CAB-align and DaliLite based on SISYPHUS_ID10. DaliLite performed slightly better than CAB-align in terms of the average agreement and reliability, but when we considered the total number of alignments with better performance, CABalign (agreement: 557, reliability: 437) performed better than DaliLite (agreement: 434, reliability: 352), where both alignment methods generated high-quality alignments (agreement > 0.5, reliability > 0.5, Fig 1A and 1B). Interestingly, CAB-align and DaliLite obtained relatively similar NormS averages (Table 1), but Fig 1C shows that CAB-align performed better than DaliLite in terms of NormS for almost all pairs. When we focused on reliability, although the N c and agreement scores for HHalign were much lower than those for the other alignment methods, the reliability value was higher for HHalign than the other methods. This result suggests that the positions aligned by HHalign were relatively shorter than the other alignments generated by structural alignment methods, but the aligned regions were the most reliable.
SCOPe and PDB dataset. We used six benchmark datasets to compare AQ, i.e., SCO-Pe_NR10_all, SCOPe_NR10_e10, SCOPe_FAMILY_all, SCOPe_FAMILY_e10, PDB30_e5, and PDB30_e10. SCOPe_NR10_all and SCOPe_NR10_e10 contained 6,799 and 3,660 pairs, respectively, which were constructed from SCOPe_NR10. In the two datasets, the two proteins in each pair were chosen from the same family. Consequently, SCOPe_NR10_all and SCO-Pe_NR10_e10 corresponded to protein pairs where there was a confirmed evolutionary relationship at the family level in the SCOPe classification. SCOPe_FAMILY_all and SCOPe_FAMILY_e10 contained 15,790 and 5,730 protein pairs, respectively, which were constructed from the SCOPe_FAMILY. The two proteins in each pair were chosen from different families, but the same superfamily. S1 Table shows the distributions of the fold classes. b Agreement with the reference alignment (Eq 1).
c Reliability of the evaluated alignment (Eq 1). d Similarity score obtained from the scoring function in CAB-align (Eq 11).
e Normalized score obtained from the scoring function in CAB-align (Eq 16).
All data are average values per alignment.
doi:10.1371/journal.pone.0141440.t001 PDB30_e5 and PDB30_e10 contained 182,907 and 122,626 protein pairs, respectively, which were constructed from PDB30. PDB30_e5 and PDB30_e10 were large benchmark datasets, which were constructed by focusing on flexible alignments. In SCOPe_NR10_e10, SCOPe_FA-MILY_e10, and PDB30_e10, each pair also had a significant relationship with E-values 10 −10 , which were estimated by HHalign. Similarly, each pair in the PDB30_e5 had an E-value 10 −5 . Fig 2 shows the average values of AQ(n) (n = 1, 2, . . ., 9) for the six benchmark datasets. For all datasets (Fig 2A-2F), CAB-align and DaliLite performed better than all of the alignment methods with any confidence value obtained from HHalign. Interestingly, for five of the six benchmark datasets, CAB-align performed better than the other alignment methods in terms of AQ(0). As shown in Table 1, the alignments obtained from HHalign had the highest reliability. Thus, CAB-align is a robust structural alignment method that obtains high-quality alignments, which had high agreement with HHalign.

Comparison of the Consistency of the Triplet Alignments
To assess the consistency of the alignments, we used six datasets that comprised triplets of alignments from the aforementioned protein pair datasets. In total, 7,384, 2,173, 50,630, 14,689, 1,403,291, and 790,623 triplets of alignments were obtained from SCOPe_NR10_all, SCOPe_NR10_e10, SCOPe_FAMILY_all, SCOPe_FAMILY_e10, PDB30_e5, and PDB30_e10, respectively.
S2 Table shows the average number of consistent positions (N consist ), length of a commonly aligned position (AL), coverage (Cov), and relative rate (Rate) for the five alignment methods using the six datasets. Fig 3 shows a plot of Cov against Rate. Based on the Rate, HHalign performed the best with all six datasets. These results correlated with the high reliability of HHalign for the SISYPHUS_ID10 benchmark dataset (Table 1). By contrast, DaliLite had the highest Cov and N consist scores, and CAB-align had the second highest scores for Cov and N consist , except for SCOPe_NR10_e10.
High scores are necessary for Cov and Rate, but they are not sufficient to obtain an accurate alignment. Thus, the consistency (Cov and Rate) does not directly represent the AQ. However, a high degree of consistency, which is derived from structural similarity, is necessary for improving multiple alignments. Fig 3 shows that the Cov score with CAB-align was comparable with that of DaliLite.

Comparison of the Classification Performance
To assess the classification performance, we used two benchmark datasets obtained from SCO-Pe_NR10 and SCOPe_FAMILY, which comprised single domains, where these two benchmark datasets are denoted as NR10 and FAMILY, respectively. Each benchmark dataset contained 249,500 domain pairs (500 versus 500 domains), excluding 500 same-domain pairs. The NR10 dataset contained 120 pairs from the same family, 732 pairs from the same superfamily, and 1,774 pairs with the same fold. The FAMILY dataset contained 410 pairs from the same superfamily and 1,288 pairs with the same fold. According to the SCOPe classification, the domain pairs within the same family or superfamily correspond to an evolutionary relationship. The fold only group proteins with similar structures; therefore, some domain pairs with the same fold only share structural similarity, and no evolutionary relationship has been confirmed. Furthermore, to assess the performance with multidomain proteins, we constructed a benchmark dataset containing 99,792 protein pairs selected from PDB30. In this multidomain benchmark dataset, all of the proteins contained at least two domains assigned by SCOPe. For this multidomain benchmark dataset, we defined a pair of proteins with no common domain class (i.e., family, superfamily, and fold) as negative; otherwise, the pairs were defined as positive. The multidomain benchmark dataset contained 946 pairs from the same family, 3,100 pairs from the same superfamily, and 5,762 pairs with the same fold. We evaluated the agreement between the five alignment methods and the SCOPe classifications at three levels (family, superfamily, and fold) using ROC and PRC analyses. Table 2 shows the AUC and AUPRC results for the family, superfamily, and fold recognition tests. The ROC curve and PRC for the superfamily recognition tests are plotted in Figs 4 and 5, respectively. For the two single-domain benchmark datasets (NR10 and FAMILY in Table 2), HHalign obtained outstanding AUPRC results but relatively low AUC results for the superfamily recognition tests, which corresponds to the evolutionary relationship. These results indicate that when the profile-profile alignment has sufficient sequence similarity, the homologous domain pairs can be detected with a high average precision but lower recall (or coverage) than the structure alignment methods. Table 2 shows the results of the comparison between CAB-align and the other structural alignment methods, which indicate that CAB-align and TM-align obtained very similar scores, with the best or second best performance in terms of AUC for the NR10 and FAMILY benchmark datasets. Interestingly, CAB-align and TM-align also obtained very similar ROC curve ( Fig  4A and 4B) and PRC (Fig 5A and 5B). These results are attributable to the limitations of the structural alignment method and the use of a benchmark dataset that only contained single domains. High flexibility (such as a domain rearrangement) is rare within a single protein domain; therefore, the key advantages of CAB-align are not applicable. CAB-align only performed slightly better than TM-align in terms of the AUC in the superfamily recognition test with both NR10 and FAMILY. The PDB30 row in Table 2 shows the performance with multidomain benchmark datasets. The ROC curve and PRC are plotted in Figs 4C and 5C, respectively. DaliLite obtained the best AUC and AUPRC results in the superfamily recognition test. The performance of CAB-align was comparable with that of DaliLite. A comparison of the results obtained using the single-domain and multidomain benchmark datasets showed that CAB-align, FATCAT, and DaliLite performed better than TM-align with the multidomain benchmark dataset. These results indicate that these three alignment methods can handle potential domaindomain rearrangements because of their flexibility. In particular, the results of the single-and multidomain recognition tests indicate that CAB-align is a robust structural alignment method for predicting whether a protein pair has superfamily relationships with a high probability.
As shown in Table 2, the number of alignments returned by DaliLite was very low. To remove the bias caused by this imbalance, we constructed three additional datasets. After removing the failed alignments from NR10, FAMILY, and PDB30, we used 47,547 (NR10), 44,234 (FAMILY), and 43,219 (PDB30) alignments, which were returned by all five alignment methods. The reduced NR10 dataset contained 111 pairs from the same family, 683 pairs from the same superfamily, and 1,592 pairs with the same fold. The reduced FAMILY dataset contained 368 pairs from the same superfamily and 1,057 pairs with the same fold. The reduced PDB30 contained 888 pairs from the same family, 2,887 pairs from the same superfamily, and 5,151 pairs with the same fold. Table 3 shows the AUC and AUPRC results for the family, superfamily, and fold recognition tests. The ROC and PRC obtained from the superfamily recognition tests are plotted in Figs 6 and 7, respectively. Similar to the results in Table 2, in the superfamily recognition test, CAB-align obtained the best AUC scores for the single-domain benchmark datasets and Dali-Lite yielded outstanding results for the multidomain benchmark dataset.

Contribution of Each Step
To assess the contribution of each step in the CAB-align method, we evaluated two components used in CAB-align denoted as step 1 and step 2. As shown in Fig 8, step 1 corresponds to the alignments obtained from the 290 initial alignments in the CAB-align procedure. These 290 initial alignments were generated by structure alignment based on the local structural similarity.
Step 2 corresponds to the alignments obtained from the 580 initial alignments. These 580 initial alignments were generated by structure alignment based on the local and global structural similarities. We selected the alignments from 290 and 580 alignments using the CAB-align scoring function (Eq 11) for step 1 and step 2, respectively. Table 4 shows the AQ results based on SISYPHUS_ID10. Fig 9 shows the AQ results based on SCOPe_NR10 and SCOPe_FAMILY. Table 5 shows the classification performance with the single-domain benchmark datasets.
The results for CAB-align and step 2 are presented in Table 5 (SISYPHUS_ID10 benchmark dataset). For step 2, S and NormS were lower than the values derived from CAB-align.  However, step2 performed marginally better than CAB-align in terms of agreement and reliability. The advantage of iterative DP in terms of the AQ was not observed with the SISPHU S_ID10 benchmark dataset. In addition, Fig 9 shows that CAB-align performed better than step 2 with all of the benchmark datasets. In terms of the classification performance, Table 5 shows that CAB-align obtained the best AUC score, but step 2 had the best AUPRC score for superfamily recognition. These results suggest that (1) step 2 dramatically improves the performance from step 1, and (2) iterative DP improves the agreement with HHalign, thereby yielding highly reliable alignments. Table 6 shows the average computation time for the five alignment methods with the SISY-PHUS_ID10 benchmark dataset (1,627 pairs). We used a general Linux computing system (Intel Xeon E5506 CPU at 2.13 GHz and 12 GB memory). We found that CAB-align was about two times slower than DaliLite. As described earlier, CAB-align generates a maximum of 580 seed alignments and performs iterative DP for each alignment. Thus, the speed of CAB-align is attributable to these complex processes.   secondary structures, and calcium-binding regions assigned by UniProtKB [38]. In this case, HHalign identified a significant relationship (E-value = 1.2 × 10 −28 ) and a reliable alignment for 146 positions between the two calmodulin-like proteins. The alignment obtained by CABalign had a higher AQ score (AQ(5) = 0.80) than DaliLite (AQ(5) = 0.43). Moreover, Fig 11 shows that CAB-align could align three calcium-binding regions, whereas DaliLite only aligned one calcium-binding region. Figs 12 and 13 show examples of protein structure comparison between 2c2f_A and 1j30_A, which are included in SISYPHUS benchmark dataset. Compared with the reference alignment in SISYPHUS (Fig 13), the alignment obtained by CAB-align had a higher agreement value (0.99) than DaliLite (0.47). CAB-align could align four helix regions for this example.

Discussion
In this study, we applied residue-residue contact area information to protein structure alignment, and we developed a novel flexible protein structure alignment method called CAB-align. The main aim of CAB-align is to identify homologous relationships at the residue level between related protein structures whenever local or global conformational changes occur. CAB-align comprises two main steps: first, 580 initial alignments are generated based on local and global structural similarities, and second, a similarity score S (Eq 11) is calculated from the residueresidue contact area matrix, which is then maximized by iterative DP. To distinguish homologous pairs of proteins from nonhomologous pairs without a size dependency, the S value is normalized by considering the total area of the inter-residue contacts.
We evaluated the performance and advantages of CAB-align using a manually created gold standard benchmark dataset (SISYPHUS) and three large benchmark datasets, i.e., SCOPe_-FAMILY, SCOPe_NR10, and PDB30. Our comparison of CAB-align with other state-of-theart protein structure alignment methods (TM-align, FATCAT, and DaliLite) showed that CAB-align was robust, and it obtained high-quality alignments for protein pairs with known evolutionary relationships. Moreover, CAB-align generated consistent multiple alignments with high coverage and accuracy rates, which were comparable with those obtained by DaliLite. Finally, CAB-align performed well at discriminating between homologous and nonhomologous pairs of proteins in both single-and multidomain comparisons.  These results suggest future applications for the CAB-align algorithm. For example, the production of high-quality alignments will facilitate the identification of functionally important positions and the functional annotation of novel proteins. CAB-align will also allow us to discover novel evolutionary relationships at the residue level. The consistent multiple alignments obtained by CABalign will help to identify structurally conserved regions as well as improving template-based modeling methods based on multiple templates. At present, there are many types of structural classification databases for single proteins (e.g., SCOP [39], SCOPe [33], CATH [5], PDBeFold [2], and FSSP [3,40]) or protein-protein complexes (PDBePISA [41]). The good classification performance of CAB-align is necessary for classifying new protein structures. The newly classified data will also contain previously unknown structural and evolutionary relationships.
The stand-alone software CAB-align and lists of the benchmark datasets are freely available to academic users at http://www.pharm.kitasato-u.ac.jp/bmd/bmd/Publications.html

Surface Representation for Each Amino Acid Residue
To calculate the residue-residue contact area matrix, we used a modified smooth surface model: the simple piecewise quadratic meatball algorithm. For the surface of the kth amino acid residue, the shape of the surface is defined by the points x, which satisfy the following equations: where N k is the number of atoms in the kth amino acid residue, g i is the center of an atom i, G() is a density function, r i is a van der Waals radius value of atom i, and c is a density coefficient that controls the degree of smoothness. In this study, we set c to 1.5, so the distances between any smoothed surfaces and the center of atoms were less than 1.5×(the van der Waals radius). We used Marching Cubes triangulation to define the triangle surface, vertices, and edges. The voxel size was set to 1.0 Å. The extracted surface was used to calculate the residue-residue contact area, where the distance between the surfaces of different residues was less than 2.8 Å. In this study, we defined an inter-residue contact area between the lth and mth residues as the sum of the surface area for the interactions between the mth and lth residues on each residue (Fig 14). Our smooth surface model was used to approximate the solvent excluded surface (SES), which is defined by a spherical probe with a radius of 1.4 Å rolling over all the atom spheres. SES is used widely to represent the surface of a molecule, and it can be calculated by various algorithms [42,43]. Our smooth surface model correlated highly with the SES values derived from MSROLL [43]. The average deviation was 8.3 Å 2 (6.4%), and the linear correlation coefficient was determined as 0.995 when taking 49,458 amino acid residues from a random selection of 200 protein structures.

Generation of the Initial Alignments
To generate various initial alignments, we employed the Smith-Waterman DP algorithm [44] according to the local and global structural similarities (Steps 1 and 2 in Fig 8). The local structural similarities were generated based on the unit-vector root mean square (URMS) distance between all pairs of heptapeptides, as described in the MAMMOTH algorithm [9], whereas the global structural similarities were obtained using a combination of MAMMOTH and the TMalign algorithm.
First, we found 290 initial alignments based on the local structural similarity with various gap open and extension penalties. According to the MAMMOTH algorithm, the score matrix used in the DP phase is defined as follows: DðURMS R À URMS l;m Þ ¼ 10; URMS R > URMS l;m 0; otherwise ; ð10Þ ( where URMS R is the expected minimum URMS between two random heptapeptides, which we set to 0.917. The affine gap penalty is defined as g(k) = α + β(k − 1), where k is the number of gaps, and α and β denote gap open and extension penalties, respectively. We used 11 gap open penalties ranging from 0 to 50 (step size = 5.0) and six extension penalties ranging from 0 to 10 (step size = 2.0). In total, we employed 58 combinations of α and β in the DP phase, where α ! β. In our study, the optimal alignments based on local similarity were not sufficient to generate the best alignment with the highest residue-residue contact area similarity. Thus, we also generated four suboptimal alignments [45] for each DP. The suboptimal alignments were generated by iteratively updating the similarity score matrix S(k, l). During each iteration, S(k, l) for the previously aligned positions was decreased by 10%, and the forward trace DP matrix was then updated and thus a new alignment was generated, which was changed slightly compared with the previously computed alignment. We obtained more than 290 initial alignments based on the global structural similarity. These 290 alignments were then re-aligned to maximize the TM-score based on a heuristic iteration described in TM-align [10]. In this procedure, the gap open and extension penalties were set to 0.6 and 0, respectively. As result, 580 initial alignments were obtained from DP based on the local/global structural similarities. Iterative DP with a Contact Area Similarity Score Following the removal of redundancies from the 580 initial alignments, the remaining alignments were subjected to iterative Smith-Waterman DP based on the residue-residue contact area matrix (Step 3 in Fig 8). To align the residue-residue contact area matrix for proteins A and B, we defined the simple similarity score as follows: where i and j denote a pair of aligned residues from A and B as i = (i A , i B ) and j = (j A , j B ), L is the length of the alignment, and θ is the similarity measure for a residue pair using the residue- residue contact area. For θ, we modified the CAD-score by considering two points: (1) strong overprediction of the contact is better than entirely missing the contact and (2) the similarity measure is a symmetrical function based on proteins A and B. θ is defined as follows: where a A i A ;j A and a B i B ;j B are the residue-residue contact areas between aligned positions i and j of A and B, respectively, and w i,j is a weight function used to control the overcounting of nearby contact when the sequence separation is less than five.
DP was applied to the similarity matrix to find an optimal alignment in a heuristic manner, which was calculated from the given initial alignment and the residue-residue contact area matrix. The preliminary similarity matrix M for all pairs of residues between proteins A and B is calculated as follows: where M(k) denotes the similarity between residues k A in protein A and k B in protein B when the other aligned positions i = (i A , i B ) are not changed for the given alignment ( Fig 15). As shown in Fig 15B and Eq 14, the conflicted pairs are ignored when (k A − i A )(k B − i B ) 0. However, M is too restrictive to improve the alignment by DP and only the local optimal alignments were obtained in our study. Therefore, M is converted into M' by considering  neighboring residues using a sliding window. M' is defined as follows: where N A and N B are the number of residues in protein A and B, respectively, and N w defines the size of the window. Thus, 2N w + 1 corresponds to the window size. M' approximates the similarity between the two fragments, i.e., N w residues around k A and k B (Fig 16). Starting from the given initial alignment, the heuristic method (iterative DP) was applied to obtain an optimal alignment. Step 3 in Figs 8 and 17 summarizes the iterative DP. First, the window size was set to 11 (i.e.,., N w = 5), and M' was then calculated from the given alignment In the matrix M 0 , the similarity score for the residue pair (k a ,k b ) is calculated from the two orange cells (M(k a -1,k b -1) and M(k a +1,k b +1)) and the red cell M(k a ,k b ). (B) Similarity score for the residue pair (k a -1,k b -1). (C) Similarity score for the residue pair (k a ,k b ). (D) Similarity score for the residue pair (k a +1,k b +1). The black cells represent the aligned positions in the given alignment. The gray cells represent the ignored pairs. doi:10.1371/journal.pone.0141440.g016 (Fig 17A and 17B). DP was performed and after each round of DP, the similarity matrix M' was updated based on the alignment obtained ( Fig 17B). The iterative DP was repeated until the similarity score S converged, or 10 iterations were reached. After this set of iterations, the size of the window was reduced, and iterative DP was then repeated again. The size of the window was reduced gradually for N w = {5, 3, 1} (Fig 8). Finally, the best alignment was obtained with the highest S value. The parameters γ (Eq 13) and the gap open penalty in iterative DP were optimized based on a training dataset (a subset of the SCOPe_FAMILY) by maximizing the average AQ(5). The training dataset contained 4,144 protein pairs with a significant evolutionary relationship (E-value 10 −10 ) and a reliable alignment length (!100). Moreover, to confirm the evolutionary relationship, both proteins in the pair had to belong to the same superfamily. Our parameter optimization indicated that the optimal parameter was γ = 0.5, and the gap open penalty in iterative DP was 90.