Sequence Based Prediction of Antioxidant Proteins Using a Classifier Selection Strategy

Antioxidant proteins perform significant functions in maintaining oxidation/antioxidation balance and have potential therapies for some diseases. Accurate identification of antioxidant proteins could contribute to revealing physiological processes of oxidation/antioxidation balance and developing novel antioxidation-based drugs. In this study, an ensemble method is presented to predict antioxidant proteins with hybrid features, incorporating SSI (Secondary Structure Information), PSSM (Position Specific Scoring Matrix), RSA (Relative Solvent Accessibility), and CTD (Composition, Transition, Distribution). The prediction results of the ensemble predictor are determined by an average of prediction results of multiple base classifiers. Based on a classifier selection strategy, we obtain an optimal ensemble classifier composed of RF (Random Forest), SMO (Sequential Minimal Optimization), NNA (Nearest Neighbor Algorithm), and J48 with an accuracy of 0.925. A Relief combined with IFS (Incremental Feature Selection) method is adopted to obtain optimal features from hybrid features. With the optimal features, the ensemble method achieves improved performance with a sensitivity of 0.95, a specificity of 0.93, an accuracy of 0.94, and an MCC (Matthew’s Correlation Coefficient) of 0.880, far better than the existing method. To evaluate the prediction performance objectively, the proposed method is compared with existing methods on the same independent testing dataset. Encouragingly, our method performs better than previous studies. In addition, our method achieves more balanced performance with a sensitivity of 0.878 and a specificity of 0.860. These results suggest that the proposed ensemble method can be a potential candidate for antioxidant protein prediction. For public access, we develop a user-friendly web server for antioxidant protein identification that is freely accessible at http://antioxidant.weka.cc.


Introduction
ROS (Reactive Oxygen Species) are generated in aerobic metabolic processes as a result of endogenous and exogenous factors, such as air pollutants and cigarette smoke [1]. Moderate concentrations of ROS can function in physiological oxidative processes of cells [2,3], predictor [23,24]

. (3) Previous methods both developed a predictor based on an individual
classifier. An individual classifier usually has its own inherent defects, which would result in poor prediction performance [25]. Ensemble classifier integrates diversity learning strategies of multiple individual classifiers, which can perform better than its component individual classifiers in protein attribution prediction [26].
To address the above-mentioned limitations and improve prediction performance with respect to antioxidant proteins, we propose an ensemble predictor using a classifier selection strategy with hybrid features, including SSI (Secondary Structure Information), PSSM (Position Specific Scoring Matrix), RSA (Relative Solvent Accessibility), and CTD (Composition, Transition, Distribution). The Relief combined with IFS (Incremental Feature Selection) method is used to select high discriminative features for reducing the computational complexity and improving prediction capability. The prediction results of the ensemble predictor are determined by an average of prediction results of multiple base classifiers. The computational framework of the proposed predictor is illustrated in Fig 1. To evaluate the performance of our ensemble predictor objectively, the present model is compared with [21,22] based on the same independent testing dataset.

Data Collection
To facilitate comparisons with previous studies in identifying antioxidant proteins, we use the benchmark dataset constructed in [22]. Only those protein sequences from the UniProtKB/ Swiss-Prot database [27] reviewed and annotated by antioxidant in the molecular function of gene ontology, are selected. In order to obtain the reliable dataset, the following criteria are further performed. (1) Sequences which are fragments of other proteins are excluded because their information is redundant and not integrity. (2) Sequences containing nonstandard letters except 20 standard amino acid alphabets are removed because their meanings are ambiguous.
After the above screening procedures, 482 antioxidant protein sequences are obtained as the original positive dataset. Due to the number of non-antioxidant protein sequences is extremely large, 500 non-antioxidant protein sequences are randomly selected as the original negative dataset. In order to avoid over fitting problem, none of the sequences has !70% sequence identity to any other in the original dataset by means of CD-HIT program [28]. The final benchmark dataset consists of 174 antioxidant proteins and 492 non-antioxidant proteins. In order to validate the performance of our proposed predictor objectively, 100 antioxidant and 100 non-antioxidant proteins are respectively selected from the final benchmark dataset as the training dataset and the rest with 74 antioxidant and 392 non-antioxidant proteins as the independent testing dataset. The samples in the independent testing dataset are not in the training dataset. The benchmark dataset is available in S1 Table.

Feature Extraction
To develop high throughput tools for predicting complicated protein attributes, it is important to represent a protein sequence with a comprehensive and proper feature vector with a fixed length [29]. In general, an individual feature extraction strategy can only preserve partial target's knowledge, thereby limiting prediction performance. Multiple feature representation methods from different sources can be complementary in capturing valuable information to enhance the discrimination power of a hypothesis. In this study, we employ hybrid features extracted from SSI, PSSM, RSA, and CTD to represent antioxidant proteins.
2.2.1 Secondary Structure Information. Protein secondary structure determines most protein reactions and reveals the intricate function of protein sequences to a great extent  [30,31]. The contents and spatial arrangements of secondary structure elements are significant factors that influence the protein intricate functions or structures [32]. The Porter 4.0 server [33] is used in this study to predict three-state secondary structures. It predicts every amino acid in a protein sequence into one of the three secondary structure elements, i.e. H (helix), E (strand), and C (coil). The following related features are extracted from the secondary structure elements.
(i) Content information is one of the most widely used secondary structure features [32], which is defined as (3 features) where N j is the number of helix/strand/coil element and L is the length of the protein sequence.
(ii) Transition information of helix/strand/coil along a protein sequence is calculated by the following equation. (9 features) where N i, j denotes the number of secondary structure element combinations from the secondary structure type of helix/strand/coil. (iii) The average length and the normalized maximal length of the segments with each secondary structure type are calculated as (6 features) where Seg i denotes the segment composed of secondary structure element helix/strand/coil. Len(Seg i ) is the length of Seg i . Max is the maximal function of segment length.
(iv) Order-related features from secondary structure elements are introduced to reflect the special arrangements of the secondary structure elements, which are formulated as (3 features) where N i is the number of helix/strand/coil element. p i, j denotes the position of the jth order of the corresponding secondary structure element.

Position Specific Scoring Matrix.
With the avalanche of genome sequences generated in the post-genomic age, the completed human genome provides a large number of novel proteins containing conserved domains [34]. The conserved domains serve as evidence for structural and functional conservations [35]. Evolutionary conservations can determine important biological functions and are important in biological sequence analysis [36]. The PSSM (Position Specific Score Matrix) is adopted here to obtain the evolutionary conservations and some essential signatures of protein sequences, which has been widely employed in protein attribute prediction problems [37,38]. The PSSM is a matrix of score values, which is derived from the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool) [39] with 3 iterations and the E-value cutoff of 0.0001. For a given protein sequence with L amino acids, the corresponding PSSM has LÃ20 elements, and is defined as . . where the rows and columns of the PSSM are indexed by the protein residues and 20 native amino acids, respectively. The values in the ith row denote the probabilities of the ith residue of the given protein sequence mutating to 20 native amino acids during the evolution process.
To formulate protein sequences into the feature vectors with the same dimension, all the rows in the PSSM corresponding to the same amino acids in a protein sequence are summed up. Then the PSSM is transformed into a 20 × 20 dimensional matrix. These 400 elements are extracted from the PSSM to encode protein sequences.

Relative Solvent Accessibility.
Solvent accessibility is a key property of amino acid residues and plays an important part in a protein's function [40]. The accessible surface area of a protein is closely related with its overall antioxidant activity. More solvent accessibility of amino acid residues represents high antioxidant activity of a protein, due to the fact that free radicals and chelate prooxidative metals can be scavenged [13]. Therefore, it is reasonable to extract features from RSA (Relative Solvent Accessibility).
The RSA is defined as the solvent ASA (Accessible Surface Area) of a given residue normalized by the ASA of this residue in an extended tripeptide, Ala-X-Ala, conformation [41]. The RSA values are predicted by PaleAle 4.0 [33]. Using the software, each residue of the query sequence is assigned a buried or exposed state.
The following 28 features are designed to encode each protein sequence. (i) Mean/standard deviation of all residues' RSA scores (2 features). (ii) Number of buried/exposed segments (2 features). (iii) Minimum/maximum length of buried/exposed segments (4 features). (iv) Average RSA score of each native amino acid (20 features).

Composition, Transition, Distribution.
We analyze amino acid composition of the residues in positive samples and negative samples. As shown in Fig 2, there is a big difference in terms of amino acid compositions between positive samples and negative samples. To further extract information on composition, order, and distribution from protein sequences, a global feature extraction strategy called CTD (Composition, Transition, Distribution), introduced by Dubchak et al. [42], is adopted to encode protein sequences.
Using CTD, three global descriptors, composition (C), transition (T) and distribution (D) are employed in this study to describe the properties of protein sequences. For a given protein sequence, composition (C) describes the global percent composition of 20 native amino acids (20 features). Transition (T) characterizes the percent frequency with amino acids of one type of native amino acids followed by another type (190 features). Distribution (D) measures the respective locations of the first, 25%, 50%, 75% and 100% of each type of 20 native amino acids (100 features). For detailed description about the CTD method, please refer to [42].

Feature Selection
After carrying out feature extraction strategies mentioned above, protein sequences are formulated by numerical feature vectors with the same dimension. However, they may not contribute equally to identifying antioxidant proteins on account of redundant and irrelevant features. These additional features may deteriorate performance of a classifier, slow down the learning process and decrease the generalization power of the learned classifiers [43]. Feature selection is an effective way to overcome these disadvantages, which can contribute to improving the classification accuracy of a classifier, simplifying a classifier, and thereby better understanding the potential physical meaning in data [44]. In this study, Relief-IFS is adopted to search the optimal features. Relief. The Relief algorithm, originally proposed by Kira [45], is a feature-weighting algorithm, which is considered one of the most successful algorithms for depicting the relevance between the features and class labels. It is noise-tolerant and requires only linear time. Based on the ability of the feature to distinguish the near samples, the Relief algorithm can be used to estimate the quality of each feature [46]. The feature with a larger weight indicates a more highly relevant one for the target prediction. The Relief algorithm is executed iteratively. During each iteration process, the Relief algorithm endows each feature with a weight as formulated by where W i p and W iþ1 p denote the current and next weights, respectively. p represents a given feature. x i stands for the ith sample sequence. H(x i ), termed as the nearest hit, represents the nearest neighbor samples from the same class label against x i . M(x i ), referred to as the nearest miss, strands for the nearest neighbor samples from the different class labels against x i . Y and S denote the sample sets with the same and different class labels against x i , respectively. m is the number of random samples. The function of diff(Ã, x, y) is used for calculating the distance between the random samples to find the nearest neighbor.
The ranked feature list can be obtained based on weights, represented as where f 1 represents the feature with the highest weight, f 2 with the second highest,Á Á Á, and f N with the lowest. Incremental Feature Selection. Based on the ranked feature list evaluated by Relief, IFS (Incremental Feature Selection), one of the well-known searching strategies of feature selection, is employed to determine the optimal feature subset. During the IFS procedure, the feature subset starts with one feature with the highest Relief weigh. Then, features in the ranked feature list are added one by one from higher to lower rank into the feature subset [47]. A new feature subset is generated when a new feature from the feature list is added. In this study, individual predictors for all feature subsets are constructed using our ensemble classifier and evaluated by 10-fold cross validation on the training dataset. The feature subset that has the highest accuracy is selected as the final input of the optimal ensemble classifier.
The WEKA (Waikato Environment for Knowledge Analysis) software package [48] is used for the feature selection algorithm Relief, where default parameters are employed. The software package can be downloaded at http://www.cs.waikato.ac.nz/ml/weka/downloading.html.

Ensemble Learning Method
Every single classifier usually has its own inherent defects, and it could not always perform well on all datasets [49]. Generally, a well-defined ensemble classifier is able to address statistical, computational, and representational issues better than its component individual classifiers [50] due to the fact that ensemble classifier is able to make use of the different decision boundaries generated from the individual classifiers to strategically combine the classification results [51,52]. The prediction performance of an ensemble classifier is affected by diversity and individual accuracy of its component base classifiers [53,54].
These 10 base classifiers are trained and ranked according to the accuracy. A ranked classifier list is obtained and represented as where C 1 represents a classifier with the highest accuracy, C 2 with the second highest,Á Á Á, and C 10 with the lowest. Based on the ranked classifier list, the idea of IFS is employed to determine the optimal classifier subset. Classifier subset starts with one classifier with the highest accuracy. Then, classifiers in the ranked classifier list are added one by one from higher to lower rank into the classifier subset. A new classifier subset is generated when a new classifier is added. We evaluate prediction performance of each classifier subset. The classifier subset with the highest accuracy is selected to construct the ensemble predictor. The prediction results of each base classifier in the selected classifier subset are combined using average probability. Fig 3 shows the diagram of the classifier selection method.

Performance Measures
In statistical prediction, there are 3 cross-validation methods often used to examine the accuracy, i.e. independent dataset test, sub-sampling test (e.g. 5-fold or 10-fold cross validation), and jackknife test [55]. Among these three methods, the jackknife test is deemed the most objective and rigorous one that can exclude the memory effects during the entire testing process and can always yield a unique result for a given benchmark dataset, as elucidated in [56] and demonstrated by Eq 50 of Chou and Shen [57]. Therefore, the jackknife test has been increasingly and widely adopted by investigators to test the power of various predictors Then, classifiers in the ranked classifier list are added one by one from higher to lower rank into the classifier subset. We evaluate prediction performance of each classifier subset. The classifier subset with the highest accuracy is selected to construct the ensemble predictor.
doi:10.1371/journal.pone.0163274.g003 [58,59]. To reduce the computational complexity, 10-fold cross validation test is employed in this paper. During the procedure, the training dataset is randomly separated into 10 equallysized parts. Each time, 9 parts are merged as training dataset to train a model, and then the other one part is for testing the model. This process is repeated ten times to test each part. The ultimate result is the average of the 10 prediction results. To assess performance of the predictor intuitively, 4 most common used indexes are employed.
Sensitivity (Sn) is the percentage of correctly identified antioxidant proteins and given by Specificity (Sp) is the percentage of correctly identified non-antioxidant proteins and defined as Accuracy (Acc) is the percentage of correctly identified antioxidant proteins and non-antioxidant proteins and expressed as MCC (Matthew's Correlation Coefficient) is a more stringent measure of prediction accuracy accounting for both under and over-predictions [60], which is given by where TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively. To further evaluate performance of a predictor, the ROC (Receiver Operating Characteristic) curve is also employed [61]. The ROC curve is plotted with the Sn as the y-axis and 1 − Sp as the x-axis by varying the thresholds. The AUC (Area Under the ROC Curve) is a valid measure used for model evaluation obtained from the ROC curve. The higher AUC value corresponds to better performance of a predictor.

Performance Comparisons of Various Individual Classifiers
We select 10 base classifiers and test prediction performance of these individual classifiers. Table 1 shows performance comparisons of these individual classifiers on the training dataset by 10-fold cross validation. From Table 1, the accuracy values of various classifiers are in the range of 0.755 to 0.895, much better than random guess (i.e., an accuracy of 0.500), indicating acceptable performance in antioxidant protein prediction. Among the various individual classifiers, RF achieves the best performance with an accuracy of 0.895, an MCC of 0.790, and an AUC of 0.957, followed by SMO, NNA, J48, BN, RBFNetwork, DT, Adaboost, VFI, and NB. In addition, RF obtains balanced performance with an Sn of 0.9 and an Sp of 0.89. These results demonstrate that RF is relatively effective in antioxidant protein identification.

Performance Comparisons of Ensemble Classifiers
To obtain the optimal ensemble classifier for antioxidant protein identification, we evaluate the accuracy of multiple individual classifiers and get a classifier list ranked by accuracy. In the classifier list, a classifier with a smaller index represents a more important one for antioxidant protein identification. The classifier list is used to select the optimal classifier subset according to the idea of IFS procedure. Add the ranked classifiers one by one from the top of the classifier list to the bottom, then, the predictor is accordingly built for each classifier subset and evaluated on the training dataset by 10-fold cross validation. Prediction performance of classifier subsets is shown in Table 2  From Table 2 and Fig 4, as the number of classifiers increases, accuracy shows an upward trend in the initial phase. Afterwards, accuracy shows a downward trend with the increase of number of classifiers. The best accuracy reaches 0.925 when 4 classifiers are selected, including RF, SMO, NNA, and J48. These 4 classifiers are used to construct the optimal ensemble classifier for predicting antioxidant proteins. The default parameters of these four base classification algorithms in WEKA are used in this paper. This ensemble classifier also achieves the best sensitivity of 0.94, specificity of 0.91, and MCC of 0.850. These results indicate that the ensemble classifier is effective in predicting antioxidant proteins. We should also note that the combination of more base classifiers don't always achieve better performance due to the fact that these base classifiers may share similar learning strategies more or less.

Performance Comparisons of Ensemble Learning Method and Individual Base Classifiers
To verify the strength of the proposed ensemble method, prediction results of our ensemble method and its component base classifiers, including RF, SMO, NNA, and J48, are compared.

Feature Selection Results
The hybrid features are ranked based on the Relief method. Within the feature list (see S2  Table), a feature with a smaller index represents a more important one for antioxidant protein prediction. Then, the IFS method combined with our ensemble classifier is employed to search the optimal features. In the IFS procedures, adding the ranked features one by one, individual predictors for all the feature subsets are constructed using our ensemble classifier and evaluated  Table. The IFS curve is plotted in Fig  5, which shows the relationship of feature indices and accuracy. From Fig 5, the curve reaches its peak with an accuracy of 0.94, when the first 152 features in the S2 Table are selected. These features are regarded as the optimal features for antioxidant protein prediction.

Contribution of Feature Selection to Our Ensemble Classifier
To investigate the influence of feature selection on the performance of the ensemble classifier, the prediction results of the ensemble method with and without feature selection are shown in Table 3. Fig 6 depicts the ROC curves obtained with and without feature selection. From Table 3 and Fig 6, the ensemble method with feature selection achieves a sensitivity of 0.95, a specificity of 0.93, an accuracy of 0.94, an MCC of 0.880, and an AUC of 0.978, which are all superior to those of the ensemble method without feature selection. These results demonstrate that many redundant or uninformative features are present in the original feature sets and the Relief-IFS method can significantly remove these useless features to greatly improve the performance of the ensemble model. The ensemble classifier with feature selection is determined as the final predictor for antioxidant protein prediction.

Analysis of the Optimal Features
The feature type distributions of the original features and the optimal features are investigated and shown in Fig 7. From Fig 7, among the 152 optimal features, there are 16 SSI features, 9 RSA features, 92 DCT features, and 35 PSSM features, indicating that all kinds of features contribute to the prediction of antioxidant proteins.
To evaluate which feature types make more contributions to prediction performance of antioxidant proteins, the percentages of the optimal features accounting for the corresponding  (PSSM), indicating that SSI features play a crucial role in predicting antioxidant proteins. Protein secondary structure reveals the intricate function of protein sequences to a great extent [30,31]. This is the first attempt to employ SSI based features for antioxidant protein prediction, which may help provide new annotations for the properties of antioxidant proteins. 32.14% of RSA features are selected as the optimal features, indicating that RSA based features play an irreplaceable role in predicting antioxidant proteins. Solvent accessibility plays an important part in a protein's function [40]. The accessible surface area of a protein is closely related with its overall antioxidant activity. More solvent accessibility of amino acid residues represents high antioxidant activity of a protein, due to the fact that free radicals and chelate prooxidative metals can be scavenged [13]. We analyze amino acid composition of the residues in positive samples and negative samples. There is a big difference in terms of amino acid compositions between positive samples and negative samples. CTD based features account for reasonable proportions of the optimal feature set. This implies that information on composition, transition and distribution plays some roles in predicting antioxidant proteins. It is noted that the ratio of PSSM based features is slightly smaller compared to that of other feature types, due to the fact that the number of this feature type in the original feature set is the most of all those of other feature types. Evolutionary conservations can determine important biological functions [36]. PSSM based features are also necessary in predicting antioxidant proteins. The proposed predictor is designed through analyzing the sequence characteristics and other characteristics about antioxidant functions of antioxidant proteins to distinguish between antioxidant proteins and non-antioxidant proteins, which can provide theory guidance for experiments on antioxidant proteins. However, this predictor cannot be used to design antioxidant proteins through multiple pathways including inactivation of reactive oxygen species, scavenging free radicals, chelation of prooxidative transition metals, reduction of hydroperoxides, and alteration of the physical properties.

Performance Comparisons with the Existing Methods
To evaluate the prediction performance objectively, we compare our method with reference [22] on the same training dataset. Table 4 reports the detailed prediction results obtained by our ensemble classifier and [22] using 10-fold cross validation. From Table 4, our ensemble classifier obtains satisfactory performance and outperforms the method in reference [22]. The sensitivity, specificity, accuracy, MCC, and AUC obtained by the proposed method are about 4%, 4%, 4%, 8% and 3.8% higher than those achieved by the method in reference [22], respectively.
To further assess the prediction performance of the proposed method, we make comparisons with [21,22] on the same independent testing dataset. The performance comparison based on the same dataset is much more reliable, which can reflect the performance of a predictor more objectively. As listed in Table 5, the prediction results of our ensemble classifier are significantly better than those of the method in reference [21]. Although the sensitivity yielded by the method in reference [22] is a little higher than that obtained by our predictor, the specificity, accuracy, and MCC of our method are significantly higher than those achieved by the method in reference [22], which indicates that an unreasonable balance between sensitivity and specificity exists in the method in reference [22]. Our method achieves a balanced performance with a sensitivity of 0.878 and a specificity of 0.860, which is also reflected by an MCC of 0.617. It also gives a satisfactory discrimination power expressed by an accuracy of 0.863 and an AUC of 0.948. From Tables 4 and 5, the predictions deteriorate significantly in the independent testing dataset. This phenomenon may be due to the fact that the independent test set is not used in the learning process of our proposed predictor. The parameters of our proposed predictor are determined in the learning process based on the training dataset. Therefore, the proposed ensemble classifier has fairly good performance in predicting antioxidant proteins, superior to previous methods, which may be conducive to better understanding physiological processes of certain types of diseases and developing novel antioxidation-based drugs.

Online Web Server
Since user-friendly and publicly accessible web-servers represent the future direction for developing more practically predictors, we have established a free web-server at http://antioxidant. weka.cc for the method presented in this paper. Users can enter query protein sequences in FASTA format or input the UniProtKB ID of the query protein sequences in the text box area for prediction. When protein sequences are submitted to the server, a job ID is presented to users. The predicted result page will return the input information and predicted result.

Conclusions
In this study, we have proposed an ensemble predictor using hybrid features extracted from SSI, PSSM, RSA, and CTD to predict antioxidant proteins. We investigate prediction capabilities of various base classifiers and obtain a classifier list based on the accuracy. Based on the ranked classifier list, the idea of IFS is employed to determine an optimal classifier subset. Compared with its component base classifiers, the optimal ensemble classifier achieves much better prediction performance. To improve prediction capability of the model and economize computational time, the Relief-IFS method is adopted to obtain the optimal features. The ensemble method with feature selection is determined as the final predictor for antioxidant protein prediction, which achieves a sensitivity of 0.95, a specificity of 0.93, an accuracy of 0.94, an MCC of 0.880, and an AUC of 0.978. To evaluate the prediction performance objectively, the proposed method is compared with existing methods on the same independent testing dataset. Our method obtains the best specificity of 0.860, accuracy of 0.863, MCC of 0.617, and AUC of 0.948. In addition, our method achieves more balanced performance with a sensitivity of 0.878 and a specificity of 0.860. It is convinced that the proposed ensemble predictor is quite promising in predicting antioxidant proteins.
Supporting Information S1 Table. The benchmark dataset. The benchmark dataset contains a training dataset and an independent testing dataset. The training dataset is composed of 100 antioxidant and 100 nonantioxidant proteins. The independent testing dataset consists of 74 antioxidant and 392 nonantioxidant proteins. (XLSX) S2 Table. The ranked feature list given by the Relief algorithm. Within the list, a feature with a smaller index represents a more important one for antioxidant protein prediction. Such a list of ranked features are used to establish the optimal feature set in the IFS procedure. (XLSX) S3