Non-sequential protein structure alignment by conformational space annealing and local refinement

Protein structure alignment is an important tool for studying evolutionary biology and protein modeling. A tool which intensively searches for the globally optimal non-sequential alignments is rarely found. We propose ALIGN-CSA which shows improvement in scores, such as DALI-score, SP-score, SO-score and TM-score over the benchmark set including 286 cases. We performed benchmarking of existing popular alignment scoring functions, where the dependence of the search algorithm was effectively eliminated by using ALIGN-CSA. For the benchmarking, we set the minimum block size to 4 to prevent much fragmented alignments where the biological relevance of small alignment blocks is hard to interpret. With this condition, globally optimal alignments were searched by ALIGN-CSA using the four scoring functions listed above, and TM-score is found to be the most effective in generating alignments with longer match lengths and smaller RMSD values. However, DALI-score is the most effective in generating alignments similar to the manually curated reference alignments, which implies that DALI-score is more biologically relevant score. Due to the high demand on computational resources of ALIGN-CSA, we also propose a relatively fast local refinement method, which can control the minimum block size and whether to allow the reverse alignment. ALIGN-CSA can be used to obtain much improved alignment at the cost of relatively more extensive computation. For faster alignment, we propose a refinement protocol that improves the score of a given alignment obtained by various external tools. All programs are available from http://lee.kias.re.kr.

: A schematic diagram to represent two alignments s 1 and s 2 and the distance between them. An alignment can be represented in two ways, (a) wrt. protein A and (b) wrt. protein B. To calculate the distance between s 1 and s 2 , for each representation, the index difference between two aligned residues aligned to the identical partner is shown in the last row. The sum of these differences is defined as the distance between s 1 and s 2 .
In the example shown here, the distance between s 1 and s 2 is 8+9=17.

Definition of distance between two alignments
The distance between two given alignments is defined as the sum of the solution difference calculated in two ways as shown in Figure A. To measure the distance between s 1 and s 2 , first, we consider all residue indices of protein A that are aligned to B in both alignments. The index difference of a protein B's residue aligned to the identical residue of protein A is calculated, as shown in the last row of Figure Aa. Similarly, the index difference of a protein A's residue aligned to the identical residue of protein B is calculated (see the last row of Figure Ab). The difference is only calculated when a residue is aligned. The sum of these differences is defined as the distance between s 1 and s 2 .

Local optimizer
For local optimization of a given alignment, we used the quenching approach by trying various local perturbation moves and accepting only score-improving ones [2,1].
We used three-step optimization as follows. In the first step, we used point extension, point deletion, and block extension ( Figure B) in a random order. For point extension, an aligned pair is added to one end of an existing aligned block. For point deletion, an aligned pair is deleted as long as it does not violate the minimum block-size constraint.
We note that a point deletion can split an aligned block into two as long as it does not violate the minimum block-size constraint. Block extension is similar to the point extension except that 2 or 3 contiguous residue pairs are extended simultaneously. When no improvement was observed after a complete round of all possible moves, we proceeded to the next step.
protein are aligned to residues (22, 23, 24, 25) of the other protein forming a block, the first ones can be shifted to either (9, 10, 11, 12) or (11,12,13,14). Similarly, the second ones can be shifted to either (21, 22, 23, 24) or (23, 24, 25, 26). The maximum shift size of the block rolling was set to 8. When a move was accepted, the alignment was sent back to the first step described in the previous paragraph. Otherwise, we proceeded to the final step.
The final step performed block deletion followed by block addition. Block deletion considers deleting up to 7 contiguously aligned residue pairs as long as it does not violate the minimum block-size constraint. After a successful block deletion, the alignment was further tried with block addition, after which the alignment was sent back to the first step. Block addition adds a block of minimum-size contiguous residue pairs ( Figure B). In the block addition, RMSD between two proteins was measured using the current alignment, and only those blocks in which all distances of the Cα pairs are within 220% of the RMSD value are tried. The block addition was tried up to 10,000 times. The quenching stopped when no improvement was observed.

Generation of a daughter solution
We consider two parent alignments p 1 and p 2 , where the alignment is represented as a set of aligned residue pairs.
Initially, the daughter alignment d is set as d = p 1 ∩ p 2 . In d, all aligned blocks are examined to satisfy the minimum block size constraint, and all under-sized blocks are extended using remaining residue pairs in R = p 1 ∪ p 2 − d. To prevent potential double occupancy of a residue in d, whenever a residue pair t is moved from R to d, we remove residue pairs in R that contain a residue index of t, and the remaining set is denoted as R .
We select a residue pair in R , t, in a random fashion. If the addition of t to d cannot satisfy the minimum block size constraint even after any follow-up operations, t is discarded. Otherwise, t is either discarded or moved to d with the equal probability. If necessary, additional residue pairs in R are moved together with t to d, to satisfy the minimum block size constraint. This procedure is repeated until R becomes a null set. If d is identical to one of its two parent alignments, it is discarded and the whole procedure is repeated up to 10 times.

CSA Procedure
With the three CSA ingredients described above, the overall procedure of CSA is carried out as follows. To start the first round of CSA, 50 alignment solutions are randomly generated. Specifically, 4-8 residue segments from one protein are paired by the same size segments from the other protein to form blocks of aligned pairs (forwardly or reversely). The total number of aligned residue pairs are set between 40 to 80 % of the smaller protein. All of the thus-generated alignments are subject to the local optimizer to constitute the first population of CSA, called first bank.
We make a copy of first bank and call it bank. The solutions in bank are updated by better ones found during the course of optimization, while first bank is kept unchanged. The initial value of D cut is set to one half of the average distance among first bank solutions.
A total of 30 solutions are selected from bank as seeds. For each seed, 30 daughter solutions are generated by performing multiple crossover procedures as described above. Nine daughter solutions are generated by crossover between the seed and a randomly selected bank solution. In the early stage of CSA, represented by the number of unused seeds being greater than 47, three of the nine crossover operations are performed between two randomly selected first bank solutions. Three daughter solutions are generated by crossover between the seed and a randomly selected first bank solution.
Another eighteen daughter solutions are generated by crossover between two preprocessed alignments starting from Another 50 randomly generated and subsequently optimized solutions are added to both bank and first bank increasing the sizes of bank and first bank to 100. We repeat the whole three round procedure of aforementioned iterations.
The value of D cut is reset to one half of the average distance among solutions in first bank. In the first round of CSA iterations, the 50 newly added solutions in bank evolves in the same way as described above. But, seeds are taken only from the newly added 50 solutions and crossover is performed only between these newly added solutions. These two restrictions are released after the first round. The whole CSA procedure is completed after two additional rounds are carried out considering all 100 solutions. One may consider repeatedly adding additional 50 solutions to bank for a more thorough optimization at the expense of more computational resources.     Table B: Alignment results of 1uok-1vjs by various alignment methods. Precision and recall are calculated based on the referential human-curated alignment. ALIGN-CSA used DALI-score as the score function. In this examplary case, the global optimization produces an alignment more similar to the referential alignment than CLICK, SPalignNS and TMalign. The precision and recall of the MICAN alignments probably because MICAN program is already trained to reproduce the referential alignments of HOMSTRAD set including this particular case. If the scoring function is finely tuned to be used in finding biologically relevant alignments, the usability of the global alignment may increase. ing DALI-score, SO-score, SP-score, and TM-score. The block size were restricted to either 1 (BS=1) or 4 (BS=4). The bars show the average scores of the alignments measured from the three test sets.