An Integrative Computational Framework Based on a Two-Step Random Forest Algorithm Improves Prediction of Zinc-Binding Sites in Proteins

Zinc-binding proteins are the most abundant metalloproteins in the Protein Data Bank where the zinc ions usually have catalytic, regulatory or structural roles critical for the function of the protein. Accurate prediction of zinc-binding sites is not only useful for the inference of protein function but also important for the prediction of 3D structure. Here, we present a new integrative framework that combines multiple sequence and structural properties and graph-theoretic network features, followed by an efficient feature selection to improve prediction of zinc-binding sites. We investigate what information can be retrieved from the sequence, structure and network levels that is relevant to zinc-binding site prediction. We perform a two-step feature selection using random forest to remove redundant features and quantify the relative importance of the retrieved features. Benchmarking on a high-quality structural dataset containing 1,103 protein chains and 484 zinc-binding residues, our method achieved >80% recall at a precision of 75% for the zinc-binding residues Cys, His, Glu and Asp on 5-fold cross-validation tests, which is a 10%-28% higher recall at the 75% equal precision compared to SitePredict and zincfinder at residue level using the same dataset. The independent test also indicates that our method has achieved recall of 0.790 and 0.759 at residue and protein levels, respectively, which is a performance better than the other two methods. Moreover, AUC (the Area Under the Curve) and AURPC (the Area Under the Recall-Precision Curve) by our method are also respectively better than those of the other two methods. Our method can not only be applied to large-scale identification of zinc-binding sites when structural information of the target is available, but also give valuable insights into important features arising from different levels that collectively characterize the zinc-binding sites. The scripts and datasets are available at http://protein.cau.edu.cn/zincidentifier/.


Introduction
Metalloproteins are proteins that can bind metal ions in order to fulfill the catalytic or structural requirement. Approximately onequarter to one-third of all proteins require metal ions to carry out their functions [1,2]. Zinc is the most abundant metal ion in the Protein Data Bank (PDB) database [3,4], playing a wide range of functional roles in eukaryotic organisms [5]. It is estimated that about 40% of the zinc-binding proteins in the human proteome are transcription factors; the remaining 60% are primarily enzymes as well as other proteins including those involved in ion transport [6]. Due to the important structural and functional roles, the presence and location of the zinc-binding sites can provide important clues regarding the function of the protein [7]. Therefore, accurate prediction of zinc-binding sites is useful not only for the inference of protein function but also for the prediction of 3D structures of proteins that contain zinc ions, thereby providing an important assistance in facilitating protein functional annotation efforts.
A plethora of computational approaches have been proposed to address the task of predicting metal (e.g. zinc) binding sites. Many research groups attempt to accomplish this by using complex machine learning methods to train prediction models based on the primary sequences of the target proteins. For example, Lin et al. [8] previously employed a neural network approach to predict the metal binding residues based on the biochemical and biophysical features derived from sequence. In another study, Lin et al. [9] developed a support vector machine (SVM)-based method with improved classification performance that used sequence-derived physicochemical properties to predict 10 metal-binding classes and identify all metal binding proteins. Their results showed that several residue properties play more prominent roles than other properties for the prediction of the zinc binding sites, including hydrophobicity, solvent accessibility, polarity and composition. Menchetti et al. [10] trained SVM classifiers to identify all Cys and His (CHs) that coordinate with zinc in the target proteins. More specifically, they described the zinc-binding site using the sequence motif [CH] x(0-7) [CH] (x(0-7) denotes a subsequence with a length from 0 to 7 residues) which underlies the correlation between the nearby residues. Their method achieved 60% precision at 60% recall for predicting zinc-binding Cys and His based on 5-fold cross-validation tests. In a follow-up study by the same group [11], the sequence motif of zinc-binding sites was integrated into two predictors to predict the zinc-binding sites and annotate metal-binding site proteins at the proteome level. Their method achieved an AUC (area under the ROC curve) of 0.890 and an AURPC (area under the Recall-Precision curve) of 0.500 for predicting zinc-binding sites using 5-fold cross-validation tests based on 2,428 protein chains. Shu et al. [12] recently combined SVM-based and homology-based predictions to identify zincbinding sites by focusing on four residue types-Cys, His, Glu and Asp (CHED). Their results showed that CHs could be more readily predicted compared to EDs. Lippi et al. [13] developed a web server to predict the metal-binding sites and disulfide bridges from sequence. Andreini et al. [14] reported a sequence-based method for prediction of metalloproteome. More recently, Passerini et al. proposed a new graph-based algorithm to predict the transition-metal-binding sites coordinated by Cys and His from sequence alone [15].
The increasing availability of high-quality structures deposited in PDB, makes it possible to fully explore the important structural information to accurately predict metal-binding sites, especially zinc-binding sites. In this regard, Sodhi et al. [16] developed a MetSite method using a variety of structural features to predict the metal-binding residues in low-resolution structural models. The Fold-X program, originally developed by Schymkowitz et al., used empirical force field to predict the metal binding sites and binding affinities for the metal ions by analyzing the protein structures [17]. Goyal and Mande [18] reported a method focusing on the geometrical constraints for the prediction of zinc binding sites. The CHED method can be used to predict transition metal-binding sites from apo protein structures [4]. The FEATURE method [19,20] applies a Bayesian classifier to recognize zinc-binding sites using a variety of physical and chemical properties that represent the microenvironment in six concentric spherical shells around the central points of zinc. Bordner developed a computational method called SitePredict that trained a Random Forest (RF) classifier using the backbone structure information to predict the metal ions and small molecules based on diverse residue properties, including spatial clustering of residue type, median-relative solvent accessible surface area and evolutionary conservation [7]. More recently, a new structure-based approach TEMSP was reported, which uses only the main chain structure of the target protein for the prediction of zinc-binding proteins [21].
Although there has been some progress for developing methods for predicting zinc-binding sites, the precision and recall of most available methods are relative low. With the rapidly increasing number of high-quality structures generated by structural genomics projects and deposited in PDB, it remains an important and pressing task to develop more accurate prediction methods that can be able to reliably identify zinc-binding or any other types of metal-binding sites, which will aid in and complement to functional annotation efforts. In particular, it is possible to develop more accurate predictors of zinc-binding sites by taking advantage of powerful machine learning algorithms, informative feature extraction and selection methods based on high-quality structural benchmark datasets. In this study, we propose an improved, generic framework based on the RF algorithm for identifying zincbinding sites in proteins by focusing on four types of residues Cys, His, Glu and Asp (CHED) of the target proteins. Our approach takes zinc-binding residues that play catalytic or structural roles as positive samples, and the remaining as negative samples. It integrates four major types of the most important features selected by a two-step feature selection procedure, including sequencebased, structure-based, graph-theoretic network and other features into the RF classifier. We investigate to what extent each selected feature is relevant to zinc-binding site prediction by removing the feature from the feature set and examining the performance of the resulting models. We further show that this novel computational framework achieves an improved overall performance on a highquality benchmark dataset and is capable of more accurately predicting zinc-binding residues compared to two other state-ofthe-art predictors zincfinder and SitePredict.

Overview of our methodology
An overview of our methodology is depicted in Fig. 1. As can be seen, there are four major stages: dataset construction, feature extraction, feature selection and zinc-binding prediction. At the first stage of dataset construction, all the four major types of zincbinding residues, i.e. CHEDs are selected from the target proteins in the curated benchmark dataset. In the second stage, these CHEDs are further encoded into feature vectors which are represented based on a local window of residues centered at each selected CHED by making use of four generic categories of features: sequence, 3D structure, network and other features. In the third stage, we perform extensive feature selection to characterize the most important and contributive features that are relevant to the prediction by recursively training RF classifiers and examining their prediction accuracy. The classifiers are trained based on diverse residue-and protein-based features. In the final stage, we optimize the RF-based classifiers using the selected important features and apply the trained classifiers to predict zinc-binding residues and compare the performance of our method with the other two methods.

Dataset
The benchmark dataset used in this study was derived from a protein dataset that was previously prepared by Passerini et al. [22]. This dataset contains 2,727 protein chains with zero HSSP, i.e. no pair of proteins had HSSP value.0 by running UniqueProt [23]. The HSSP value measures the pairwise sequence similarity between the two sequences by taking into account both sequence identity and sequence length [23]. We extracted proteins with a resolution better than 2.5 Å , which resulted in a final reduced dataset of 1,103 chains.
We finally selected a total of 49,631 CHEDs from the 1,103 chains. Among them, 144 chains bind to at least one zinc ion (See Table 1 for statistics), 79% of which are Zn3 or Zn4 binding chains (defined as ''Znx'', where ''x'' denotes the number of amino acid residues bound to zinc ions). The total number of residues binding to zinc is 484. Among these residues, more than 86% of the selected Cys, His, Glu and Asp residues bind to Zn3 or Zn4.
Residues are considered as binding to zinc if the distance between the atoms of the residue and zinc [24] is within a threshold of 3 Å . There are totally 223 zinc atoms binding to the selected 144 chains, as shown in Table 1.
Zinc ions of zinc metalloproteins are significant for their catalytic and structural functions. Zinc ions that bind to four residues (defined as ''Zn4'') are considered as structural zinc, which maintain the stability of the protein but are not involved in any biochemical reaction. Zinc ions bind to three residues (defined as ''Zn3'') are generally catalytic zinc, which bind to molecules and may be involved in a reaction. Except for Zn3 and Zn4, other zinc ions require cooperation with other metal ions in order to fulfill the function, usually linking with side chain atoms or water molecules. These zinc ions are called co-catalytic zinc (Co-catalytic Zn) [12]. Apart from the above zinc types, some zinc ions only bind to one or two residues; they are located on the surface of the zinc metalloproteins and lack the biological function. In this study, we only focused on the functional zinc ions (i.e. Zn3, Zn4 and Cocatalytic Zn which have catalytic or structural roles [25,26]). In summary, CHEDs binding to these zinc ions are positive samples  We also generated an independent test set. We randomly divided the 484 positive samples into six parts, one of which was used as the independent test set while the remaining five parts were merged together and used as the benchmark dataset which was used for feature selection and 5-fold cross-validation test purposes. The negative samples were randomly selected with the negative to positive ratio of 6:1 for both benchmark training and independent test datasets. Note that the independent test dataset was never used during the model training and feature selection process; the prediction model was trained based on the benchmark dataset and then applied to the independent test dataset for its performance evaluation.
Moreover, an apo protein structure dataset was produced according to the holo structures included in the current datasets, using the same strategy as described by Ebert and Altman [4]. More specifically, we first searched the PDB database by detecting the apo protein structures that do not contain zinc and share 95% or more sequence identity with the holo structures. We then identified the residues in the apo structures that correspond to those binding zinc in the holo form. Finally, we obtained a set of 31 apo/holo structures containing 96 zinc-binding residues. For evaluating the prediction performance on the apo protein structure set, the ratio of negative to positive was also set as 6:1. The curated benchmark dataset, independent test dataset and apo protein dataset are available at http://protein.cau.edu.cn/ zincidentifier/.

Feature extraction
We select and extract a variety of features that describe different sequence, structure and network microenvironments which might be of importance for zinc-binding site prediction and which are used in combination as input to the RF-based classifiers. Features used in our framework comprise of four generic types: sequence, structure, network and other features (listed in Table 2). In particular, although solvent exposure features are generally considered as structural properties, this particular feature type is important for improving the prediction performance of zincbinding residues. Also, we find several solvent exposure features useful for the prediction which were mostly understudied in previous work, which is discussed in the following sections. Therefore, we consider the solvent exposure feature as an additional feature type and would like to investigate its ability to improve upon the prediction versus other structure features.
1. Sequence features. A wide range of sequence features are used which make a significant improvement for the prediction. These include: (i) position-specific scoring matrices (PSSMs) from PSI-BLAST [27] using a default E-value cutoff with three iterations; (ii) native disorder predicted by DISOPRED2 [28]; (iii) protein aggregation properties generated by TANGO [29]; (iv) sequence length; (v) CHED residue percentage in the protein chain; (vi) conservation score directly derived from the PSSM. The conservation score is defined as: where p i,j is the frequency of amino acid j at position i. It was extracted from the PSSM generated by PSI-BLAST.
1. A combination of these sequence or sequence-derived features has been shown to be able to further improve the prediction performance in our recent work [30][31][32][33]. 2. Structure features. DSSP [34] is used to extract a series of structure features, including the secondary structure, solventaccessible surface area (SASA), hydrogen bonds and protein backbone torsion angles (PHI and PSI angles). HBPLUS [35] is used to calculate the number of hydrogen bonds. NACCESS is also used to generate the respective solvent accessibility of the side chain, main chain, non-polar and polar region [36]. 3. Network features. Graph-theoretic approach derived from the residue-residue contact graphs is becoming a useful method for unraveling the modular network organization of protein structure and predicting the protein function given the structure information [37][38][39]. If the spatial coordinate distance of the corresponding Ca atoms of two residues is within a cutoff distance threshold, the two residues are defined as being in contact. Residue-residue contact graphs were constructed using the spatial coordinates of protein structures, with a cutoff distance of 6.5 Å between the Ca atoms of two residues in the structure. We use the JUNG library (available at http://jung.sourceforge.net/) to calculate the network properties that represent the local microenvironment of residues, including clustering coefficient, degree, density, eccentricity, betweenness, closeness and status. 4. Other features. New solvent exposure features are also used in our study. They provide an important description of the local spatial environments of the residue of interest, providing useful information that is complementary to the traditional solvent exposure measures, such as solvent accessibility and accessible surface area [40][41][42]. We find that these features are particularly important for the prediction and thus discuss their relevance by considering them as a separate feature type from other structure features. These include the number of C b atoms in the lower Half-Sphere (HSEBD), residue depth (RD), the coordination number (CN), the number of C a atoms in the upper Half-Sphere (HSEAU), the number of C b atoms in the upper Half-Sphere (HSEBU), the number of C a atoms in the lower Half-Sphere (HSEAD) and atom depth of a residue's C a atom (RDa). The hsexpo program in the Biopython package is used to calculate these solvent exposure features [43].
We use a sliding window approach with a size of 9 residues to extract all sorts of different features and take as input to train the RF models. Each residue is represented by Vn (n = 1, 2, 3, …, 9, denoting the position of the residue in such window; the centered residue is denoted as V5). Similarly, the elements in the PSSM (with a dimension of 9620 = 180) are denoted as V1, V2, …, V180, respectively.

Feature selection
Due to the nature of having a large number of heterogeneous and possibly redundant features in many data mining tasks, feature selection techniques play an increasingly important role in bioinformatics applications [44]. Thus, a great deal of attention is being paid to application of these techniques that can potentially improve the prediction accuracy of machine learning classifiers, for example, in the prediction of protein folds [45], catalytic residues [46], protein crystallization [47], helix-helix interaction in membrane proteins [48] and disulfide connectivity [49]. In our study, we used a two-step feature selection method based on RF [50] to identify most informative and contributive features for the prediction of the zinc-binding sites. In the first step, the mean decrease Gini index (MDGI) was used to generate the optimal feature candidates (OFCs), which was calculated by the RF package implemented in R [51]. MDGI describes the importance of an individual vector element for correctly classifying a residue as zinc-binding or non-zinc binding. The mean MDGI Z-Score of each vector element is calculated by the following equation: x x is the mean of all MDGI values of all vector elements of the feature x, and s is the standard deviation (SD). We obtained the optimal feature candidates (OFCs) by selecting the features from the top 70 MDGI Z-Scores features.
The second step is a stepwise feature selection based on the RF classifier. To evaluate the prediction performance of the classifiers, 5-fold cross-validation tests were performed, where the whole benchmark dataset was randomly split into five subsets. In each validation step, one subset was left out as the testing set, while the remaining four subsets formed the reduced training set. The RF model was trained on this reduced training set and subsequently validated by testing it on the left out subset. This procedure was repeated five times, each time leaving out a different subset of the samples such that in the end each subset was in turn left out exactly once. The final performance was calculated by averaging over the five subsets. In the first round of the stepwise feature selection, we trained the RF model with the 70 optimal selected feature candidates. Then in the next round, one feature would be removed from the set of the optimal feature candidates. If the resulting RF classifier achieved a higher score (score = AUC*recall*precision, which can be considered as a balanced score incorporating the three measures AUC, recall and precision), this feature would be removed. In particular, this score was averaged over 100 times to obtain a stable value, hence the order in which features were removed in each round was almost certain. This combined score was used to evaluate the performance of the RF classifier because it is useful for comprehensively monitoring the changes of all of the three measures. The stepwise feature selection procedure continued until this score no longer increased. This procedure allows the most important and informative features to be systematically identified.

Random forest-based classifier
In this study, we formulate the prediction task of zinc-binding sites as a binary classification problem and solve it using a machine learning approach-random forest (RF), where the zinc-binding residue is labeled with 1 and non-zinc binding residue is labeled with 21. RF is an ensemble tree-structured classifier [51]. A typical RF extends many classification trees; each tree generates a classification, and the tree ''votes'' for one of the two classes (positive or negative). To classify and predict whether a residue is zinc-binding or not, the physicochemical properties and the microenvironment of the selected residue are represented by a feature vector and encoded into the RF-based predictor to classify this vector as either positive (zinc-binding) or negative (non-zinc binding). The random forest classifier selects the classification as the final prediction that has the the largest number of votes (among all the trees in the forest). An advantage of RF is that it is particularly suitable for dealing with a large dataset with highdimensional and noisy input features. In addition, as RF does not require time-consuming optimization process, the classifier training and prediction by RF are generally much faster than many other algorithms such as SVM. Due to these advantages, RF has been extensively applied in many classification and regression tasks and demonstrated satisfactory performance, for example, in predicting residue-residue contacts [52], helix-helix interaction in membrane proteins [48], DNA-binding residues [53] and protein interaction sites [54]. We use the randomForest R package [51] for the implementation of the RF algorithm.

Performance Evaluation
We use four performance measures, namely, Recall (REC), Precision (PRE), the Area Under the Curve (AUC) and the Area Under the Recall-Precision Curve (AURPC) to evaluate the prediction performance of the methods. In our study, the dataset is heavily imbalanced with the negative to positive ratio of 100:1. For this highly imbalanced dataset, use of the Accuracy measure, i.e. the proportion of true prediction results (true positives+true negatives) in the dataset, is not appropriate, since all-positive or all-negative classifiers can also achieve a very good classification rate. Therefore, we consider using alternative metrics such as AUC and AURPC for a comprehensive evaluation of the models' performance. AUC is the area under the receiver-operating characteristic (ROC) curve, which is a plot of true positive rate (TPR) against false positive rate (FPR). AURPC is also used in our work to evaluate the performance, which is regarded as a good alternative to AUC if there is a large skew in the class distribution [55].
The Precision is defined as:

PRE~T R TPzFP
The Recall (TPR) is defined as:

REC~T P TPzFN
The Specificity is defined as:

SPE~T N TNzFP
The FPR is defined as:

FPR~F P FPzTN
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives, respectively.

Results and Discussion
Two-step feature selection by random forest By means of reducing the high dimensionality of feature vectors, feature selection can be effectively used to select the most informative features and generate a more succinct feature subset from the initial feature set with a large number of noisy and redundant features, which can allow us to significantly improve the prediction performance of machine learning-based classifiers [44]. To give insights into the feature characteristics and select the most contributive features for the prediction, we performed a two-step feature selection. The first step was to select the optimal feature candidates, while the second step was a stepwise feature selection to generate the final list of optimal features which would be used to train the best model.
In the first step of feature selection, we used the MDGI score to generate the initial OFCs based on a RF classifier. As a result, the top 70 features ranked were selected as the initial OFCs (Fig. 2). As can be seen, PSSM had the highest mean MDGI Z-Score, solvent accessibility feature calculated by NACCESS and solvent exposure feature also had larger Z-Scores. Meanwhile, some important features such as CHED_residue_percentage, chain length and network_closen_cent were ranked with a lower score. The percentage of each of the four types of features, as shown in the pie chart in Fig. 2, indicates that most of the contributive features in our method were predominantly derived from sequence information.
At the second step, a stepwise feature selection method was used (see the ''Feature Selection'' section). If the removal of a feature led to a higher prediction score of AUC*Recall*Precision, such feature would be deleted from the feature set. By iteratively removing the redundant and less informative features from the initial optimal feature candidate set, it is expected that the prediction performance could be gradually improved during this process. This two-step feature selection that combined random forest MDGI Z-Score ranking and a stepwise selection, provides a realistic approach for selecting an optimal subset of more important features within a reasonable computational time compared with other intensive feature selections [56]. In this work, this feature selection procedure assessed a total of 878 features in 9 h and 70 features in 290 h (56 rounds), respectively, on a 2*4 Xeon Intel X5650 processor Linux server (Ubuntu 10.10). Consequently, we obtained a more compact and useful feature subset that improved prediction of the RF classifiers with the application of this two-step feature selection method.

Feature importance and contribution
Finally, 14 features were selected by our two-step feature selection method. The nomenclature of the selected features is specified as follows: PSSM_5_H and PSSM_5_C denote the PSSM feature in the respective positions of 9 (the ''H'' column in the PSSM) and 5 (the ''C'' column in the PSSM) of the centered residue (i.e. 5th position in the local window); Conservation_-score_V5 describes the conservation score of the centered residue (Conservation feature corresponds to a 9-dimensional vector with 9 residues in the local window); All_polar_abs_V5 describes the absolute solvent accessibility of the all-polar side chain of the centered residue, which was calculated by NACCESS; ChainCH-ED_H_(P) denotes the percentage of His (H) in the CHEDs of the protein chain; ChainCHED_D_(P) and ChainCHED_E_(P) have similar meanings as ChainCHED_H_(P); Ex_CN_V5 and Hbplus_V5 belong to the new solvent exposure feature type and describe the coordination number (CN) and the number of Hbonds of the centered residue, respectively; Chain_E_sum and Chain_CHED_sum denote the numbers of Glu (E) and CHED in the chain, respectively; chain_length represents the number of all residues in the chain; Net_closen_cent_V9 denotes the closeness centrality of the 9th residue in the local window. In particular, inclusion of Net_closen_cent_V9 in the final feature set means that the network closeness property of neighboring residues of the centered residue in the residue-contact graph also plays an important role for prediction of zinc binding sites. The boxplots for the 14 final selected features are given in Fig. 3. It can be seen that while some feature values of the positives contrast sharply to their negative counterparts including All_polar_abs_V5, Chain_CHED_sum, Chain_E_sum, chain_length, Ex_CN_V5, other feature values did not show any obvious distinction between the positives and negatives (including chain_residue_H(P), Hbplus_V5, Net_closen_cent_V9, PSSM_5_C and PSSM_5_H), yet they collectively made a significant contribution to the performance of the final best RF model.
It is interesting to note that almost a majority of selected features were derived from sequence information rather than structure. Among the final 14 features selected by our two-step feature selection method, indeed 10 of them were derived from sequence information. In contrast, there were only two selected structure features. The possible reasons might be: (1) It has been well established that certain sequence features are important for predicting zinc-binding sites, such as the evolutionary information contained in the PSSM matrix and conservation score, which are well known features making an important contribution to the prediction performance; (2) Most initial features used in our study are derived from sequence information. For example, 70 features were selected and retained after the first step feature selection according to Z-Score, 47% of which are sequence features and most of which have relatively high Z-Scores (see Fig. 2). Therefore, it is expected that a large portion of the selected features after the second step feature selection would be derived from sequences; (3) The selected 10 sequence features have a good complementarity with other selected feature types, such as the structure and network features, thereby making a collective contribution to improved prediction of zinc-binding sites.
We further went on to elucidate the importance and contribution of the 14 final selected features at both residue and protein levels. We examined the influence of each of the 14 features on the prediction performance by iteratively removing each from the final feature set and consequently calculating the corresponding REC, PRE, AUC and AURPC of the resultant RF classifiers. The results are presented in Table 3. PSSMs and conservation score have been found useful for the prediction in previous studies [12]. In this study, they also made a great contribution to the prediction performance, especially the PSSM_5_C feature. The importance of this feature is reflected by the fact that there would be a large decrease in 7 out of 8 performance measures when we used the rest of 13 features to train the RF model. In the absence of the PSSM_5_C feature, the performance of the trained models considerably deteriorated, as reflected by the REC, AUC and AURPC at both residue and protein levels (at residue level: REC decreased from 0.804 to 0.760, AUC from 0.960 to 0.954 and AURPC from 0.832 to 0.815; at protein level: REC decreased from 0.748 to 0.703, AUC from 0.966 to 0.960, AURPC from 0.825 to 0.808). These correspond to the first and second largest performance decrease, which indicates the importance of the inclusion of this feature for zinc-binding site prediction of our method.
Moreover, the percentage of Glu (E) and Asp (D) in CHEDs of the protein chain also made a considerable contribution to the performance improvement, which is reflected by the changes of AURPC at both residue and protein levels by comparing the resulting performances in the presence and absence of the given features. About 96% of zinc binding residues were CHEDs in our dataset. Compared to Glu and Asp, Cys and His had higher percentages of residues that bind zinc. The percentage of Glu and Asp were only 16.5% and 11.2% in the positive CHEDs, whereas in the negative CHEDs, 41.2% and 35.7% were Glu and Asp, respectively. In other words, a protein chain is more likely to bind zinc ions if it has a lower percentage of EDs and a higher percentage of CHs. Accordingly, it is expected that the percentage of Glu and Asp in CHEDs of the chain is a useful feature for zincbinding site prediction.
As can be seen from Table 3, there are two structure features selected in the final feature set. They are All_polar_abs_V5 (calculated by NACCESS) and Hbplus_V5 (calculated by HBPLUS), which define the absolute solvent accessibility of the all-polar side chain and the number of hydrogen bonds of the centered residue, respectively. After removal of the two features from the feature set, PRE of the resulting RF classifiers decreased from 0.750 to 0.714 in the case of Hbplus_V5 and to 0.715 in the case of All_polar_abs_V5, respectively. These correspond to the first and second largest decrease of PRE, demonstrating the importance of these two features for the prediction. In addition, there are two other features Ex_CN_V5 and Net_Closen_CentV9 in the final feature set. The former is a descriptor of solvent exposure of the centered residue in terms of the residue coordination number (CN), while the latter is a network feature which describes the mean geodesic distance from the centered residue to the neighboring node within the residue contact graph of the protein. We can see from Table 3 that Ex_CN_V5 makes a good contribution to AURPC and PRE at both residue and protein levels. However, there is not obvious decrease after the removal of the selected network feature Net_closen_cent_V9. It is well known that closeness centrality indicates a degree of residue centrality in the protein 3D structure. This feature has been previously found useful for predicting functionally important residues, including enzyme catalytic sites which are generally centrally placed in the structure [57]. However, in this study, we find that closeness centrality is not an informative feature for predicting zinc-binding sites. As zinc ions in the zinc-binding proteins are often located at the side of the structure and play structural roles (for example, zinc in zinc finger proteins helps stabilize and coordinate the alpha-helix and two beta-strands necessary for the stabilization of the structure and interaction with DNA, RNA and other molecules), the closeness centrality is not an effective feature for predicting zinc-binding sites. Yet, this feature may be useful for balancing the performance and contribute to the prediction when used in combination with other important features, which is the case when classifying the positives at other thresholds (data not shown).

Comparison with other methods
We further compared our method with other previously developed methods. The performance of each of the compared methods was summarized using the AURPC, AUC, REC and PRE measures at both residue and protein levels (A true positive at the residue level was defined as a zinc-binding residue that was correctly predicted, while a true positive at the protein level was defined as a zinc-binding protein for which at least one zincbinding residue was correctly predicted). In particular, for the sake of the comparison, we compared different methods at the fixed PRE. In addition we plotted the ROC and Recall-Precision curves to show the performance of each method (Fig. 4). ROC curves were plotted using the ''ROCR'' package [58] at the two levels. Recall-Precision curve is a measure of the overall quality of the prediction and is often used as an alternative complement to ROC curves especially when analyzing datasets with a large imbalance in the class distribution [59] by incorporating both the Recall and Precision measures. The AURPC was calculated using the method of Davis and Goadrich [55]. Although most zinc-binding site prediction studies predominantly evaluate the performance using AURPC, we employed both AUC and AURPC to comprehensively evaluate the performance of different methods in this study.
There exist several methods that have demonstrated good performance for predicting zinc-binding residues [7,11,12]. However, due to the unavailability of webserver or standalone software, some methods could not be directly compared to our method. Moreover certain methods can only be applied to predict zinc-binding proteins [21] and are not capable of predicting zincbinding residues, while others predict the spatial position of zincbinding or metal-binding sites [19] rather than zinc-binding residues. In addition, as our benchmark dataset was derived from the previous dataset on which Shu et al.'s method was built, we cannot compare with that method either. Therefore, we compared the predictive performance of our method with another two stateof-the-art methods that focused on prediction of zinc-binding residues, namely, SitePredict [7] and zincfinder [11]. The prediction performance of the three methods is shown in Table 4.
SitePredict is a structure-based method that also uses the RF algorithm to predict zinc-binding sites [7]. The RF classifiers of SitePredict were trained on diverse residue-based site properties comprising of spatial clustering of residue types and evolutionary conservation. We tested the performance of SitePredict by submitting our 5-fold cross-validation (benchmark) dataset to its online webserver. At the residue level, SitePredict generated 252 TP, 84 FP and 151 FN predictions, which corresponds to a REC of 0.625 at the fixed PRE of 0.750. In contrast, our method produced 324, 108 and 79 for TP, FP and FN predictions in the 5fold cross-validation tests, corresponding to a REC of 0.804 and a PRE of 0.750, respectively. Although our method predicted more FPs than SitePredict, the REC of our method is significantly better than that of SitePredict (0.804 versus 0.625), which means that our method is more accurate in predicting the positive samples (zincbinding residues) than SitePredict. Meanwhile, AUC was also improved from 0.923 to 0.960, implying that our method outperformed SitePredict. We also compared our method with SitePredict using AURPC. From Fig. 4A, we can see that the Recall-Precision curve of our method is consistently higher than that of SitePredict, with AURPC increased from 0.738 to 0.832. At the protein level, we also observe a similar tendency that the performance of our method is better than that of SitePredict, with REC, AUC and AURPC increased from 0.649, 0.923 and 0.726 to 0.748, 0.966 and 0.825, respectively, suggesting that our method provides a better classification performance for recognizing zinc-binding sites than SitePredict. Zincfinder is a sequence-based tool for zinc-binding site prediction using SVMs [11]. Zincfinder was applied to our benchmark set and generated 288 TP, 96 FP and 108 FN predictions, respectively at the residue level, with a REC of 0.728 at the equal PRE of 0.750. Compared to zincfinder, REC of our method was increased to 0.804. Zincfinder achieved AUC and AURPC which were respectively worse than those of our method at the residue level, but it attained a higher AURPC at the protein level (0.825 versus 0.841, Table 4). While our method was outperformed by zincfinder at the protein level (reflected by the REC and AURPC), it achieved a predictive power comparable to, if not better than, zincfinder with lower REC and AURPC at the protein level, but with higher REC and AURPC at the residue level as well as higher AUC at both levels. We also compared our method with these two methods on the independent test set. The performance is shown in Table 5. We can see that the REC, AURPC and AUC of our method are respectively better than those of the other two methods at both residue and protein levels. In addition we plotted the Recall-Precision and ROC curves at the two levels in Fig. 5. The Recall-Precision and ROC curves have a similar tendency as those on the benchmark set at the two levels. Overall, these results suggest that our method has outperformed SitePredict and zincfinder on this independent test set.
Considering that zincfinder used only sequence features and our method also had a majority of selected features derived from sequence information, we also trained our model using the 10 selected sequence features only, applied it to make the prediction, and compared the results with zincfinder (See Table 4 and Table 5, respectively). It can be seen that our method is able to provide a competitive and in some cases better performance than zincfinder in terms of Recall, AUC and AURPC at both residue and protein levels based on the benchmark and independent datasets. This suggests that our method that only used the sequence features has also outperformed zincfinder. Overall, the results indicate that zincfinder has a better ability to predict more TPs at the protein level while our method is more accurate and displays a greater predictive power at the residue level. In addition, it is noteworthy that there is an overlap of more than 60% of protein chains in our training set and the benchmark dataset of zincfinder, which may explain why the sequence-based tool zincfinder was better than the other structure-based tool SitePredict and why our method was outperformed by zincfinder at the protein level. Altogether, our method provides a competitive performance compared to the other two state-of-the-art methods on the 5-fold cross-validation benchmark dataset.

Predictions for the apo protein structures
Next, we examined how well our method performs on the apo protein structures, the prediction of which is more relevant for the actual use of our approach. We thus evaluated the performance of our approach based on the generated apo protein structure dataset (See Dataset section). The performance is shown in Table 5. As can be seen, our method achieved the Recall of 0.719 and 0.742 at a corresponding Precision of 0.750 at the respective residue and protein levels on this apo structure dataset. The AUC is 0.953 and 0.970, AURPC is 0.713 and 0.762 at the two levels. Despite the limited size of the apo structure dataset, these results indicates that our method is less sensitive to the structural rearrangements due to the binding of zinc ion, and that our method can be used to identify the zinc-binding sites in the apo protein structures.
Predictions for the four types of zinc-binding residues As discussed above, there are four major types of residues that bind to zinc-Cys, His, Glu and Asp (CHED). Table 6 shows a comparison of the predictive performances of the three different tools for predicting these types on the training set. Cys and His were predicted with higher accuracy than Glu and Asp. For example, the REC values of Cys and His are 0.909 and 0.891 at an equal PRE of 0.750, respectively, which are much higher than those of Glu and Asp, which only had REC values of 0.319 and 0.656, respectively. A similar tendency was observed for AURPC. On the other hand, we note that AUCs of Glu and Asp are higher than those of Cys and His. This is mainly because Cys and His have a higher percentage in positive samples (zinc-binding residues) while Glu and Asp have a higher percentage in negative samples (non-binding residues). The proportions of zinc binding residues of CHED are C: 26.9%, H: 45.5%, E: 16.5% and D: 11.2%. In contrast, the proportions of all CHED samples are C: 10.9%, H: 19.7%, E: 36.1% and D: 33.3%. Hence, the larger percentage of Glu and Asp in negative examples resulted in much more FP predictions than those of Cys and His, leading to a lower PRE. It is also why AURPC was decreased (Fig. 6A) with the decrease of REC. REC indicates the prediction performance of positive samples, which does not necessarily reflect the predictive performance of negative samples. From this perspective, AUC can be considered as a balanced measure of the overall quality of the prediction as it incorporates both SEN (the same as REC) and SPE (which is TN/(TN+FP)) measures. A higher ROC curve of Glu and Asp indicates a better accuracy of negative samples (Fig. 6B). Therefore, AUC is also an important performance measure of our method, playing a crucial rule that is complementary to AURPC. For the sake of comparison, we also list the predictive performance of CHED by another two methods SitePredict and zincfinder in Table 6. We can see that there are straight lines in some of the precision-recall and ROC curves for zincfinder, for example, the ROC curves of ASP, this is mainly because that for 83.3% of negative samples of Asp (D) residues, the predicted value of zincfinder is 0. Similarly, 89.7% of negative samples of Glu (E) residues were predicted as 0, thus there were no data for zincfinder with FPR.0.103 and as a result there was a straight line in its ROC curve.

Conclusion
We have developed a powerful computational framework that combines a variety of sequence, structure, graph-theoretic network and other features of residues to improve prediction of zincbinding sites, which relies on a two-step RF algorithm to retrieve and select the most useful and contributive features to train the prediction models. Using this integrative framework, we investigated what information can be retrieved from the different types of residue microenvironments that is relevant to the prediction of zinc-binding sites. By focusing on four major types of residues: Cys, His, Glu and Asp, it achieved 80% recall at 75% precision in 5-fold cross-validation tests on the benchmark dataset containing 1,103 high-quality structures and 484 zinc-binding residues. We found that Cys and His were generally better predicted with Table 6. Predictive performance of the four types of zincbinding residues (CHED) of our method, SitePredict and zincfinder on the 5-fold cross-validation benchmark dataset. respective recalls of 91% and 89%, which were much higher than those of Glu and Asp. We also performed independent test and the results showed that our method provided a higher REC, AUC and AURPC than SitePredict and zincfinder at both residue and protein levels. Benchmarking experiments showed that our method has outperformed the two other state-of-the-art methods at the corresponding 75% precision on both benchmark and independent test datasets. In summary, the performance improvement of our method may be attributed to three important ingredients: (i) construction of a high-quality non-redundant structural benchmark dataset; (ii) integration of a variety of features including sequence, structure, graph-theoretic network and other features and inclusion of novel features that collectively make a good contribution to the performance; (iii) a two-step feature selection method to remove the overlapping and redundant features. In addition, this framework is generally applicable to many other classification problems in structural bioinformatics and can be readily extended to solve the prediction task of other types of metal-binding or ligand-binding sites. We believe that it can be an effective tool for accurately identifying zinc-binding sites with the increasing availability of high-quality structure data.