Symmetric Key Structural Residues in Symmetric Proteins with Beta-Trefoil Fold

To understand how symmetric structures of many proteins are formed from asymmetric sequences, the proteins with two repeated beta-trefoil domains in Plant Cytotoxin B-chain family and all presently known beta-trefoil proteins are analyzed by structure-based multi-sequence alignments. The results show that all these proteins have similar key structural residues that are distributed symmetrically in their structures. These symmetric key structural residues are further analyzed in terms of inter-residues interaction numbers and B-factors. It is found that they can be distinguished from other residues and have significant propensities for structural framework. This indicates that these key structural residues may conduct the formation of symmetric structures although the sequences are asymmetric.


Introduction
Symmetric proteins [1] are ideal objects to investigate protein evolution and folding. It is generally accepted that symmetric proteins have been arisen from gene duplications and fusions [2,3]. However, these repetitive or symmetric signals were almost lost in their sequences during evolution but remain in their structures. Investigating how these proteins keep their symmetric structures by ''asymmetric'' sequences is a way to understand protein evolution and folding. On the other hand, understanding the building principle of symmetric proteins is also necessary for designing de novo proteins, because symmetric structures are relatively simple to be built from basic units. One solution to the problem above is that protein sequences may contain hidden symmetric signals that determine their symmetric structures [4][5][6][7][8].
Multi-domain proteins provide ideal models to study the problem above since many of them consist of more than one domains evolved from the same ancestor and have similar structural symmetry but different sequence symmetry. For example, Ricin Toxin B (RTB, PDB id: 2aaib) is composed of two domains with the same beta-trefoil structure of three-fold symmetry [16][17][18]. It was speculated that RTB is the twice triplicate duplications of its ancestor, a galactose-binding peptide of about forty residues [18]. Rutenber et al. detected hidden threefold sequence symmetry in both domains [18] but the degrees are very different. In its first domain the averaged sequence similarity index between the trefoil units equals 1.73 while in its second domain it is 2.63, i.e., one half larger than that of the first domain.
This appears in contradiction with their almost identical structures. Since these two domains have evolved from the same ancestor, they are ideal model to understand sequence-structure relations of proteins. In fact, for RTB, Haze detected a three-fold repetitive QXW motif in both domains and regarded them as key structural residues [19]. Rutenber and Robertus also described a 12-residue hydrophobic core in both domains [20] and later Murzin et al. further showed that these residues are characteristic of the beta-trefoil fold [17]. It seems that these key residues may be the main factor to determine the symmetric structure. However, more evidences are needed to validate this conclusion. At least, we need to investigate other proteins in the same family.
According to Structural Classification Of Proteins (SCOP) databank [21], RTB belongs to Plant Cytotoxin B-chain (PCB) family and all proteins in this family contain two domains with beta-trefoil structure (see Materials and Methods). In this paper we shall analyze their sequence symmetries and identify their key structural residues by three different methods: structure-based multisequence alignments, residue interaction number and B-Factor analysis. We shall also extend our analysis to all presently known beta-trefoil proteins. Our results show that there exist similar key structural residues in all these proteins that may determine the symmetry of their structures.

Plant Cytotoxin B-chain Family
According to SCOP1.69, there are five species and sixteen protein chains in PCB family (Table 1). Among them, two species, European mistletoe and Sambucus ebuLus, have more than one protein chains. We select 1m2tb and 1hwmb as their representatives because both have crystal structures of the highest experimental resolutions (Table 1) [22]. The atomic coordinates of the crystal structures (PDB file) and experimental resolutions are retrieved from Protein Data Bank (Table 1).

Detection and Quantification of Protein Sequence Symmetry
In a previous paper [12], we developed a modified recurrence plot (MRP) algorithm to detect protein sequence symmetry, and defined two parameters R and S to quantify the degree of the detected sequence symmetry. Here, we only introduce them briefly.
The MRP of a protein sequence x 1 x 2 x 3 … x N is built as follows: the horizontal axis i denotes the location of the first residue of a segment in sequence and the vertical axis d denotes the length of the segment. For any segment X i = x i x i+1 … x i+d21 , if the number of its non-overlapping similar segments X j = x j x j+1 … x j+d21 (|j2i|$d) is larger than the degree of symmetry you want to find, we plot a point at (i, d). The MRP is formed when this is done for all possible i and d. Two segments are similar if the percentage of their similar residues, obtained by using pairwise global sequence alignment with PAM250 score matrix, is larger than a chosen number r and when p-value is lower than 0.05.
The parameter R is the Pearson's correlation coefficient between iMRP and rMRP, where iMRP denotes the ideal symmetric MRP corresponding to the real MRP (rMRP) of protein sequence. R reports the presence of non-overlapping repetitive patterns. Because the R value cannot definitely tell us the degrees of similarities of different patterns and so the degree of sequence symmetry, we introduce a parameter S to do this. S is the average value of the Pearson's correlation coefficients between all different patterns and describes the average similarity of different patterns. Therefore, the S value is a measure of the degree of sequence symmetry. For a sequence to be symmetric, both R and S should have large values. The details of this method can be found in ref. 12. It is noted that there existed other methods to find repeats of a protein sequence [4][5][6][7][8].

Evaluation of Residue Interactions
The residue interaction number (RIN) of a residue is the number of the interaction pairs between this residue and other residues that are more than four residues apart along sequence and their potential energies are lower than 20.5kcal/mol [23,24]. The potential energy is calculated with all-atom force field and implicit solvent model (GB/SA) [25,26]. It is the sum of three energy terms: Van der Waals energy, electrostatic energy and solvent polarized energy. The third term denotes electrostatic interactions DG pol between the solute and solvent and is calculated by where D ij~r 2 ij =4a i a j and r ij is the distance between atom i and atom j. q i and q j are the charges of atom i and atom j. e is the dielectric constant of the solvent. a i is the effective Born radius of atom i, which is related to the effective Born free energy of solvation. The molecular mechanics software we used is Tinker with Charmm27 force field [27,28]. Before formal calculations we optimize protein structure by conjugate-gradient method and the gradient tolerance is 0.1kcal/(Å mol).

Results and Discussions
Three-fold sequence symmetries of different degrees Fig. 1 gives the MRPs of the two domains of the five representative protein chains (r = 0.3 as in the previous paper [12]). It shows that all MRPs contain three repetitive patterns. The R values of all domains are larger than 0.5, and all the S values are larger than 0.4 only with one exception (Table 2). In our previous work, R$0.5 and S$0.4 are set as the cutoff values to measure whether a MRP shows symmetry or not [12]. Thus, almost all domains show hidden three-fold sequence symmetries. However, the MRPs of all the second domains reveal a pattern of three approximately right-angled triangles and the pattern is much more distinguishable than those of the first domains (Fig. 1). This means the symmetry degree of the second domains is higher than that of the first domains. In agreement with this, the R and S values of the second domains are all larger than those of the first domains with only one exception (Table 2) and the differences of the S values are significant, equaling 0.18, 0.10, 0.30, 0.22 and 0.18, respectively, and being about 35.3%, 22.7%, 54.6%, 34.4% and 34.6% of their respective means. This is in agreement with the result of RTB [18].
For the five representative proteins, the first domains are superposed to their second domains with the aid of OPAAS [29] and the root-mean-square distances (RMSD) are all less than 2Å (Table 1), i.e., the first and second domains have similar structures. Therefore, the symmetry degrees of the first and second domains are the same at structural level but different at sequence level. This is also in agreement with the result for RTB [18]. Key structural residues of three-fold repetitions Structure-based multi-sequence alignments. In the first and second domains of all the five representative protein chains of PCB family, we identified four repetitive motifs through structurebased multi-sequence alignments of trefoil units (Fig. 2) [30,31].  3 , where X denotes any residue. They are totally composed of twenty-four residues and show three-fold repetitions (Fig. 3). The four different residues (I, L, M, V) are all large hydrophobic residues [32,33]. Generally, one residue is considered as buried if it has less than 25% solvent accessibility [34]. Using WHAT IF [35], we find that the four three-fold repetitive motifs are almost buried in the interior of their structures.
Consider RTB as an example to show the four three-fold repetitive (FTR) motifs in detail. The distribution of these motifs in the structure is illustrated in Fig. 3. It is shown that each beta strand has one motif and each trefoil unit has four motifs. Threefold repetitions of the four motifs just correspond to the three-fold trefoil units in both domains. Moreover, these motifs are distributed symmetrically in the three-dimensional structures.
The first motif is located at the top of the barrel structure, the fourth at the middle and the remaining two at the bottom. The FTR motifs seem to form the framework of the structures and act as key residues contributing to the formation of the symmetric structures, namely, the so-called key structural residues. Three previous works have reported some key structural residues in RTB [17,19,20]. Comparing them with the FTR motifs, we find they have a large overlap. Since other four representative protein chains show the same FTR motifs, they can be considered as the key structural residues of PCB family.
Inter-residue interactions. We use another approach to confirm the FTR motifs acting as key structural residues in PCB family. We calculate their inter-residue interactions. The key structural residues should have more interactions with others. RTB is selected as an example too. The average residue interaction number (RIN) of all residues, buried residues, and all residues in FTR motifs is 4.98, 6.31 and 8.50 respectively ( Table 3). The average RIN of the FTR motifs is the largest among them ( Table 4). The FTR motifs are mainly composed of buried residues. Generally, a buried residue likely has a large RIN.    However, the average RIN of the FTR motifs are larger than that of other buried residues. This indicates that they may play the role of key structural residues. Furthermore, as shown in the plot of the RIN versus amino acids, the residues in the FTR motifs almost always have the locally largest RINs although they may not be the globally largest (Fig. 4A). As for other four representative protein chains, the results are similar (Table 3 and Fig. 4). Hence, it is a common feature that the residues of the FTR motifs have larger RIN and they play the role of hubs in the inter-residue interaction network. Fig. 5 gives the interaction energies between the key structural residues of each representative protein chain (Fig. 5). In each plot there are six ''L''-like patterns along diagonal (each domain has three patterns), which denote the strong residue interactions. There are few interactions between different trefoil units. We compared these patterns with the positions of the key structural residues and found the six ''L''-like patterns are just corresponding to the six repetitions of the four motifs or the six trefoil units. Furthermore, the ''L''-like patterns indicate similar inter-  residue interaction patterns in every trefoil unit. Therefore, every trefoil units not only have similar key structural residues but also similar strong residue interactions. This suggests that the repetitive key structural residues may determine the three-fold trefoil units. Finally, the ''L''-like patterns show that the second motifs, (L/M/V) 3 , have stronger interactions with other motifs. This may be that the second motifs are closer to other three motifs (Fig. 3).  B-factors. From an experimental point of view, since the key structural residues act as the skeleton of structures, they should be much more constrained than other residues. The B-factors retrieved from PDB file are generally characteristic of the degree of atomic constraint. We average the B-factors of all heavy atoms in one residue and designate the mean as the B-factor of this residue. For RTB, the average B-factor of all residues, buried residues, and all residues in the FTR motifs is 25.35, 22.73 and 22.20 respectively (Table 3). Clearly, the FTR motifs have the smallest average B-factor. Furthermore, as shown in the plot of the B-factors versus amino acids, the residues in the FTR motifs always have the locally smallest B-factors (Fig. 4A). As for other four representative protein chains, we gain the same results as RTB (Table 3 and Fig. 4). Therefore, the FTR motifs seem to be most strongly constrained. In summary, both the inter-residue interactions and B-factors also suggest that the FTR motifs may be key structural residues in PCB family.

Extension to all beta-trefoil folds
Are the three-fold repetitive key structural residues special for beta-trefoil proteins in PCB family or common for all proteins sharing beta-trefoil fold? In our recently published paper [12], thirty protein chains/domains were selected as the representatives of the presently known proteins with beta-trefoil fold. Because the two domains of 1vcla are homologous and also because only the atomic coordinates of alpha carbon atoms can be retrieved from PDB database for 2ila-, twenty-eight protein chains/domains are set as the representatives (Table S1 in Supporting file S1). Two algorithms, CE and TM-align integrated in STRAP [36][37][38], are used to do their structure-based multiple sequence alignments. Interestingly, both alignment methods detected similar twelve conserved motifs ( Figure S1 and Figure S2 in Supporting file S1). We compare them with the FTR motifs and find they are similar. The twelve conserved motifs also show three-fold repetitions. In addition, we notice the twelve conserved residues as well as the FTR motifs are mainly composed of large hydrophobic residues (I, L, V, F, W), which is in agreement with the previous prediction by Murzin et al. that the large hydrophobic residues stabilize the betatrefoil fold [17]. Recently, Chaudhuri et al. [39] pointed out that at least 80% propellers across families are similar at a level indicative of homology. To support their conclusion, one evidence is that all propellers share similar key sequence motifs across families. We [23,24] also studied the key residues in the protein domain G from transducin (PDB id: 1tbg ), which is a propellerlike protein composed of seven similar blades or called WD-repeats and has a high structural symmetry. From a structure-based sequence alignment, it can be observed that there are five residues that are almost totally invariant in each repeat of the protein. These structurally conserved residues connect the outer strand of each blade to the inner three strands of the next blade, and are certainly considered as key residues critical for the structural stability of the G protein. We calculated the contact energies by all-atom force field and found that the residues with lowest contact energies (or strong inter-residue interactions) are in good agreement with the structurally conserved residues identified previously. Here, the proteins with beta-trefoil fold show the similar situation. All evidences suggest that the three-fold repetition of key structural residues should dominate the three-fold symmetric structures. Thus, the contradiction of different degrees of structure and sequence symmetries of the two domains of PCB family proteins can be interpreted in terms of similar key structural residues.
In conclusion, we analyzed the proteins with two repeated betatrefoil domains in Plant Cytotoxin B-chain family and all presently known beta-trefoil proteins by three different methods and show that some key structural residues may play important roles in the formation of the three-fold symmetric structure of beta-trefoil fold.
These key structural residues are (i) buried residues, (ii) symmetrically located in the structure, and (iii) have large residue interaction numbers and small B-Factors. This result may be helpful to design de novo proteins.