Prediction of Metal Ion–Binding Sites in Proteins Using the Fragment Transformation Method

The structure of a protein determines its function and its interactions with other factors. Regions of proteins that interact with ligands, substrates, and/or other proteins, tend to be conserved both in sequence and structure, and the residues involved are usually in close spatial proximity. More than 70,000 protein structures are currently found in the Protein Data Bank, and approximately one-third contain metal ions essential for function. Identifying and characterizing metal ion–binding sites experimentally is time-consuming and costly. Many computational methods have been developed to identify metal ion–binding sites, and most use only sequence information. For the work reported herein, we developed a method that uses sequence and structural information to predict the residues in metal ion–binding sites. Six types of metal ion–binding templates– those involving Ca2+, Cu2+, Fe3+, Mg2+, Mn2+, and Zn2+–were constructed using the residues within 3.5 Å of the center of the metal ion. Using the fragment transformation method, we then compared known metal ion–binding sites with the templates to assess the accuracy of our method. Our method achieved an overall 94.6 % accuracy with a true positive rate of 60.5 % at a 5 % false positive rate and therefore constitutes a significant improvement in metal-binding site prediction.


Introduction
The structure of a protein determines its function and its interaction(s) with other components, e.g., other proteins and cofactors, including metal ions. Approximately one-third of all proteins bind at least one metal ion [1,2,3], and many different types of metal ion-binding proteins are found in humans [4,5]. Metal ions help stabilize protein structure, may induce a conformational change upon binding, and/or participate in catalysis. Metal ions found in proteins include those of the alkali metals, alkaline earth metals and transition metals, with the most common being sodium and potassium ions, calcium and magnesium ions, and iron, manganese, copper and zinc ions, respectively. For the metal ion-binding proteins found in the Protein Data Bank (PDB http://www.rcsb.org/pdb/), ,66 % contain transition metal ions, ,37 % contain alkaline earth metal ions, and ,6 % contain alkali metal ions [6].
In humans, hemoglobin transports oxygen in the blood from the lungs to peripheral tissues. Hemoglobin contains four heme groups that reversibly bind Fe 2+ . Fe 2+ coordinates four heme nitrogens and, reversibly, one oxygen. In the absence of an oxygen, a water molecule is bound. Urease, expressed by the Gram-negative microaerophilic bacterium Helicobacter pylori, requires Ni 2+ for its function. Urease hydrolyses urea into carbon dioxide and ammonia to produce an alkaline environment that protects the bacterium from acidic gastric juice during its infection of the stomach. Thus, in both prokaryotes and eukaryotes, metal ionbinding proteins are extensively involved in many different biochemical reactions. Identifying metal ion-binding sites is, therefore, key to understanding the functional mechanisms of metal ion-binding proteins.
Experimentally, metal ion-binding proteins are identified and/ or characterized using nuclear magnetic resonance spectroscopy [7], gel electrophoresis [8], metal-affinity column chromatography [9], electrophoretic mobility shift assay [9], absorbance spectroscopy [10], and mass spectrometry [8]. Most of these methods require complex steps and specialized equipment, making them unsuitable for unknown targets. There is considerable demand, therefore, for other ways to identify metal ion-binding sites. Computational methods have been used to identify metal ionbinding sites, e.g., support vector machines [6,11,12], neural networks [6,13], the FoldX force field [14], the CHED algorithm [15,16], graph theory and geometry algorithms [17,18]. Some of these methods use only sequence information [6,11,12], whereas others use both sequence and structure information [17,18]. However these previous attempts to predict metal ion-binding sites have often had low sensitivities; clearly, predictive accuracy must be improved.
On average, the members of the Structural Genomics Initiative solve 20 new protein structures each week. Currently, the PDB contains more than 70,000 protein structures. In general the regions in proteins that interacts with ligands, substrates, or other proteins tends to be structurally conserved [19] and the residues involved are in close spatial proximity even though they may be distant in sequence. Such residues constitute , 10-30 % of a protein sequence [20,21,22]. The residues that most often bind metal ions are CYS, HIS, GLU and ASP [23,24] because the atoms of their polar or charged side chains can coordinate metal ions. For the work reported herein, we used the fragment transformation method [25] to identify residues in proteins that bind Ca 2+ , Cu 2+ , Fe 3+ , Mg 2+ , Mn 2+ , or Zn 2+ . This method combines sequence and structural information contained within spatially local fragments. Given that the three-dimensional structure and residue type are often conserved, similar binding regions can be found by comparing the types of residues and their relative locations with those of computationally constructed metal ion-binding residue templates.

Overview
First, the structures of known metal ion-binding proteins were extracted from the PDB. Next, a database containing metal ionbinding sites templates was constructed. Each template included residues at least partially within 3.5 Å of the metal ion center. The structure of the protein being queried for a metal ion-binding site (query protein) was then compared with each template using a ''leave-one-out'' comparison method. The fragment transformation method [25] attempts to structurally align fragments of the query protein and the metal ion-binding residue template. After each comparison, each residue in the query protein was assigned an alignment score that is composed of two functions for the evaluation of sequence and structure conservation. The sequence similarity is calculated by using the BLOSUM62 substitution matrix [26], and the structure similarity is taken by measuring the root mean square deviation (RMSD) of the Ca carbons of the local structural alignments. Residues that score above the assigned alignment-score threshold are predicted to bind metal ions. This method is illustrated in Figure 1.

Dataset containing the metal ion-binding proteins
The proteins in the final dataset were extracted from the PDB and contain at least one Ca 2+ , Cu 2+ , Fe 3+ , Mg 2+ , Mn 2+ , or Zn 2+ ion. At the time of our study, approximately one-fourth of all PDB entries (20094 of 77294 proteins) contained a metal ion(s). The following criteria were applied to these proteins as filters. If the structures did not contain any polypeptide chain, those structures were excluded. For proteins containing more than one polypeptide chains, we included only the chains with residues involved in metal ion-binding. The length of the polypeptide chain was required to be more than 50 residues. DNA and/or RNA components were removed, leaving only the polypeptide chain.
To ensure that many different types of proteins were included in the dataset, proteins were grouped according to their superfamily by SCOP (version 1.67) [27]. Proteins that could not be classified by in this manner were removed. Finally, BLASTClust, in the standalone BLAST package (version 2.2.10) [28], was used to align the sequences in a pairwise fashion so that the remaining proteins could be sorted into groups that had sequence identities $ 25%. This step was performed to remove the redundant structures from the dataset because sequences with at least 25 % identity usually have similar conformations. For each cluster we retained the first entry as representative of the cluster. The final dataset is composed of 1,109 polypeptides representing 361 SCOP superfamilies.   Metal ion-binding residue templates Figure 2 shows an example of a local structure containing metal ion-binding residues, i.e., those at least partially within 3.5 Å of a metal ion center as judged by their PDB coordinates. To be considered as a template, a site needed contain more than two metal ion-binding residues. In total, 1,410 templates were generated from the 1,109 polypeptides. Table 1 list the statistics for each kind of metal ion-binding polypeptide and metal ionbinding template.

The fragment transformation method
In general, the fragment transformation method [25] aligns similar local fragments that contain residues that are discontinuous in sequence but spatially close; for our study, the method was modified to align metal ion-binding residues. The fragment transformation method treats each binding residue as an individual unit. The structural unit used to align the query protein and the templates is a triplet formed by the backbone N{C a {C atoms of a given residue. S denotes the query protein of length m, T denote template of n residues. The N{C a {C triplets of S and T be given by (xN,xCa,xC) and (yN,yCa,yC) respectively, where x and y are the PDB coordinates for that atom. S and T can therefore be expressed in terms of the triplets as Note that the information contained in the peptide bonds preceding and following a residue is not used, meaning that s and t are not representative of the backbone torsion angles, w and Q, which require the coordinates of C 0 {N{C a {C and N{C a {C{N 00 , respectively, where C' is the carbonyl carbon preceding the residue and N'' is the amide nitrogen of the next residue. Thus, the fragment unit do not contain information concerning the torsion angles.
A matrix of dimensions m|n is then constructed for the residues of S and T as: where the element M ij is a rigid-body transformation matrix that Performing triplet clustering D ij kl , defined as the Cartesian distance between the target t l and the transformed triplet M ij s k , provides a measure of how similar the orientation of the triplet pairs (s i ,t j ) and (s k ,t l ) is, which allows us to cluster the triplet fragments using the single-linkage algorithm [29] as follows. If for two triplet pairs,(s i ,t j ) and (s k ,t l ), D ij kl vD 0 , and i=k and j=l, then the triplets are clustered. LetG 1 and G 2 be two clusters, the first containing (s i ,t j ) and (s k ,t l ) and the second containing (s i 0 ,t j 0 ) and (s k 0 ,t l 0 ). If

Scoring function
The metal ion-binding score, C i , for each residue i, is defined as where e m is the number of triplets of S m , i.e., the aligned residues of the query structure. The alignment scores C R m , C B m are defined as: and where RMSD(S m ,T m ) is the root mean square deviation of allC a atoms between S m and T m ;BLOSUM(S m ,T m ) is the sequence alignment score between S m and T m , calculated using the BLOSUM62 [26] substitution matrix, and BLOSUM(T m ,T m ) is the maximum sequence alignment score of T m . The value of RMSD(S m ,T m ) should be less than 3 Å , and C B m should be greater than C B 0 which can be adjusted to obtain the best result for each type of metal ion. Finally, the normalized metal ion-binding score, Z C i , is calculated as: where C and SD C denote the mean and the standard deviation, respectively, of the metal ion-binding score.

Performance assessment
The performance of the metal ion-binding site prediction method, i.e., the prediction accuracy (ACC), was defined as the number of true positive and true negative and evaluated using a leave-one-out approach. The accuracy (ACC), the true positive rate (TPR) and false positive rates (FPR) were calculated using the

Metal ion-binding residue profiles
Spheres each with a 3.5 Å radius from the center of a metal ion were constructed for each metal ion-site in our dataset. We assessed the frequency that each of the 20 amino acids coordinated a metal ion (Fig. 4); those metal ions were found to preferentially bind certain residues, as follows: for Ca 2+ , ASP, GLU, ASN, and GLY; for Cu 2+ , HIS; for Mg 2+ ASP and GLU; for Fe 3+ , HIS, GLU, ASP, CYS, and TYR; for Mn 2+ , ASP, HIS, and GLU; and for Zn 2+ , CYS and HIS. Notably, each type of metal ion favors specific residues.
The preferred types of atoms surrounding the metal ions are as follows ( Figure 5): for Ca 2+ , backbone and side-chain oxygens; for Mg 2+ and Mn 2+ , side-chain oxygens; for Cu 2+ , Fe 3+ , and Zn 2+ , oxygen, nitrogen, and sulfur. Each metal ion appears to preferentially bind certain atoms in certain residues.

Predictive performance
For each metal ion, we set the threshold of the normalized metal ion-binding score so that the FPR was #5% (Fig. 6). For Ca 2+binding proteins, the threshold was 1.6, which gave a 94.1 %  accuracy and a TPR of 48.9 %; for Cu 2+ -and Mg 2+ -binding proteins, the threshold was 1.8, which yielded 94.9 % accuracy and a TPR of 85.6 %, and 95.0 % accuracy and a TPR of 61.4 %, respectively; for Fe 2+ -and Mn 2+ -binding proteins, the threshold was 1.0 for 94.9 % accuracy and a TPR of 85.4 %, and 94.6 % accuracy and a TPR of 37.0 %, respectively. The best performance was obtained for Zn 2+ -binding proteins, for which a threshold of 2.2 gave 94.8 % accuracy and a TPR of 71.1 %. The performance of the predictions as a function of the threshold values for six types of metal ion-binding proteins is shown as receiver operating characteristic plot (TPR values vs. FPR values, Fig. 7). The predictive performance was excellent for Cu 2+ -and Fe 3+ -binding proteins and very good for Mn 2+ -and Zn 2+binding, but less so for Mg 2+ -and Ca 2+ -binding proteins.

Comparison with published methods
We compared our results with those obtained using the artificial neural network (ANN) method [30] and the geometric subgraph method [18]. The same types of metal ion-binding sites were used in the three studies, and the methods were each designed to predict every residue within a metal ion-binding protein as a binding or a non-binding residue. When the FPR was # 5 %, our method was more accurate and had greater TPR values than did the ANN method (Table 2). Given the similar accuracies (61 %), the larger TPR values were especially noticeable for the Cu 2+ -and Fe 3+ -binding proteins (TPR = 85.6 % and 85.4 % for our method, and 36.2 % and 48.8 % for the ANN method, for the two types of proteins, respectively). The TPR values for Mn 2+ and Zn 2+ also dramatically improve-from 38.8 % to 61.4 % for Mn 2+  and from 47.8 % to 71.1 % for Zn 2+ . The TPR for Ca 2+ also increased from 30.4 % to 48.9 %; however, the improvement was much smaller for Mg 2+ , from 32.4 % to 37.0 %. The average TPR for the six classes of proteins for our study was 60.5 % (FPR# 5 %), which is an improvement compared with the results obtained using the geometric subgraph method (TPR, 46.9 %; FPR, 11.9 %).
Template matching Figures 8,9,10,11,12,13 show examples of an alignment for each type of metal ion-binding protein and the corresponding template. The structures were drawn by PyMOL [31]. For human cytosolic phospholipase A2 (PDB ID: 1RLW; Fig. 8) [32], which has two Ca 2+ -binding sites, seven binding residues were found, all with large normalized metal ion-binding scores. The template that best  aligned with the Ca 2+ -binding sites in phospholipase A2 was the chain A of synaptotagmin I C2B-domain (PDB ID:1K5W) [33] (Fig. 8). The template for the Cu 2+ -binding protein, human ceruloplasmin (PDB ID:1KCW) [34], almost perfectly aligned with the Cu 2+ -binding site in the A chain of plastocyanin (PDBID:1BAW) [35] (Fig. 9), although a few FP metal ionbinding residues were also identified. The best predictive performance was found for Fe 3+ -binding proteins. For desulfoferrodoxin (PDB ID:1DFX) [36], two templates derived from two different proteins, superoxide reductase chain A (PDB ID:1DO6) [37] and rubrerythrin chain A(PDBID:1B71) [38], matched its two binding sites, and the nine binding residues, plus an FP, were identified (Fig. 10). Although the identification of Mg 2+ -binding sites was not as successful, because many FPs were associated with high scores, the Mg 2+ -binding site of human mitochondrial deoxyribonucleotidase chain A (PDB ID:1MH9) [39] was found to be similar to the template constructed from transglutaminase 3 chain B (PDB ID:1NUG) [40] (Fig. 11). In cytochrome b1 chain A (PDB ID:1BFR) [41], two Mn 2+ -binding sites were in close proximity and involved the same two glutamic acids (Fig. 12). These binding sites were found using the template from ribonucleotide reductase chain A (PDB ID:1KGP) [42], even though a reorientation of the template was required during the fragment transformation procedure. For Zn 2+ -binding proteins, a near perfect match was found for chain A of the inhibitor of apoptosis protein DIAP1 (PDB ID:1JD5) [43] and the template from chain E of the baculoviral IAP repeat-containing protein 4, BIR 2 (PDB ID:1I3O) [44] (Fig. 13).

Discussion
In this study, we developed and used a structure comparison method to predict metal ion-binding sites in proteins. During development, we combined conserved structure and sequence information to identify metal ion-binding residues, which are extremely important design elements as they substantially affect the prediction. Our prediction method performed much better for Cu 2+ , Fe, 3+ Mn 2+ , and Zn 2+ than it did for Ca 2+ and Mg 2+ , possibly because there are fewer types of residues that bind the transition metal ions compared with those that bind the alkaline earth ions. Thus, the residues and structures of the Ca 2+ -and Mg 2+ -binding sites may be less specific. In particular, we observed that backbone carbonyl oxygens, rather side-chain oxygens, frequently bind Ca 2+ and Mg 2+ , which indicates that the type of residue is less important-at least for an interaction involving a carbonyl oxygen. Conversely, interactions between backbone atoms and Cu 2+ , Fe 3+ , Mn 2+ , and Zn 2+ are rare; instead, sidechain atoms bind these ions; causing steric and chemical limitations imposed by the particular side-chain. These two factors, i.e., residue and atom-binding patterns, probably result in smaller sequence alignment scores for the metal ion-binding residues. As such, the final metal ion-binding scores for certain residues may in fact be lower than the threshold value set for metal ion-binding residues.
Our approach yielded excellent predictions for Cu 2+ -and Fe 3+binding sites, and very good predictions for Zn 2+ -and Mn 2+binding sites. Although the method gave poorer results for Ca 2+and Mg 2+ -binding sites, it nonetheless performed better than did the geometric subgraph and ANN methods. Ultimately, for an FPR threshold of 5 % our method achieved an overall 94.6 % accuracy with a TPR of 60.5 %, which is a substantial improvement over other prediction methods currently available. Therefore, our method may find use as a predictor of putative metal ion-binding proteins and their binding. The Linux binary codes for our method are available upon request.