iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition

Nitrotyrosine is one of the post-translational modifications (PTMs) in proteins that occurs when their tyrosine residue is nitrated. Compared with healthy people, a remarkably increased level of nitrotyrosine is detected in those suffering from rheumatoid arthritis, septic shock, and coeliac disease. Given an uncharacterized protein sequence that contains many tyrosine residues, which one of them can be nitrated and which one cannot? This is a challenging problem, not only directly related to in-depth understanding the PTM’s mechanism but also to the nitrotyrosine-based drug development. Particularly, with the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop a high throughput tool in this regard. Here, a new predictor called “iNitro-Tyr” was developed by incorporating the position-specific dipeptide propensity into the general pseudo amino acid composition for discriminating the nitrotyrosine sites from non-nitrotyrosine sites in proteins. It was demonstrated via the rigorous jackknife tests that the new predictor not only can yield higher success rate but also is much more stable and less noisy. A web-server for iNitro-Tyr is accessible to the public at http://app.aporc.org/iNitro-Tyr/. For the convenience of most experimental scientists, we have further provided a protocol of step-by-step guide, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of its development process. It has not escaped our notice that the approach presented here can be also used to deal with the other PTM sites in proteins.


Introduction
As one of the post-translational modifications (PTMs) of proteins, nitrotyrosine is a product of tyrosine nitration mediated by reactive nitrogen species such as peroxynitrite anion and nitrogen dioxide (Fig. 1). Compared with the fluids from healthy people, a remarkably increased level of nitrotyrosine is detected in those suffering from rheumatoid arthritis, septic shock, and coeliac disease. Accordingly, knowledge of nitrotyrosine sites in proteins is very useful for both basic research and drug development. Although conventional experimental methods did provide useful insight into the biological roles of tyrosine nitration [1][2][3], it is time-consuming and expensive to determine the nitrotyrosine sites based on the experimental approach alone. Particularly, identification of endogenous 3-NTyr modifications remains largely elusive (see, e.g., [4][5][6][7]). With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop computational methods for identifying the nitrotyrosine sites in proteins. The present study was initiated in an attempt to propose a new method for identifying the nitrotyrosine sites in proteins in hope that it can play a complementary role with the existing methods in this area.
As summarized in [8] and demonstrated in a series of recent publications [9][10][11][12][13][14][15][16][17][18][19][20][21], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the biological samples with an effective mathematical expression that can truly capture their essence and intrinsic correlation with the target to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform crossvalidation tests to objectively evaluate the anticipated accuracy; (v) establish a user-friendly web-server that is accessible to the public. Below, let us describe how to deal with these steps one by one.

Benchmark Dataset
To develop a statistical predictor, it is fundamentally important to establish a reliable and stringent benchmark dataset to train and test the predictor. If the benchmark dataset contains some errors, the predictor trained by it must be unreliable and the accuracy tested by it would be completely meaningless.
For facilitating description later, let us adopt the Chou's peptide formulation here that was used for studying HIV protease cleavage sites [22,23], specificity of GalNAc-transferase [24], and signal peptide cleavage sites [25]. According to Chou's scheme, a potential nitrotyrosine peptide, i.e., a peptide with Tyr (namely Y) located at its center (Fig. 2), can be expressed as where the subscript j is an integer, R {j represents the jÀth upstream amino acid residue from the center, R j the jÀth downstream amino acid residue, and so forth. A (2jz1)Àtuple peptide P j (Y) can be further classified into the following categories: where P z j Y ð Þ represents a true nitrotyrosine peptide, P { j Y ð Þ a false nitrotyrosine peptide, and [ represents ''a member of'' in the set theory.
As pointed out by a comprehensive review [26], there is no need to separate a benchmark dataset into a training dataset and a testing dataset for examining the performance of a prediction method if it is tested by the jackknife test or subsampling (K-fold) cross-validation test. Thus, the benchmark dataset for the current study can be formulated as where S z j only contains the samples of P z j (Y), i.e., the nitrotyrosine peptides; S { j only contains the samples of P { j Y ð Þ, i.e., the non-nitrotyrosine peptide (cf. Eq. 2); and | represents the symbol for ''union'' in the set theory.
Since the length of the peptide P j (Y) is 2jz1(Eq. 1), the benchmark dataset with different values of j will contain peptides of different numbers of amino acid residues, as formulated by S j contains the peptides of 13 residues, when j~6 15 residues, when j~7 17 residues, when j~8 19 residues, when j~9 21 residues, when j~10 The detailed procedures to construct S j are as follows. (i) Its elements were derived based on the same 546 source proteins used in [27] that contain 1,044 nitrotyrosine sites (see columns 1 and 2 of Supporting Information S1). (ii) Slide a flexible window of 2jz1 amino acids (Fig. 3) along each of the 546 protein sequences taken from the Uni-Prot database (version 2014_01). (iii) Collect only those peptide segments with Y (tyrosine) at the center. (iv) If the upstream or downstream in a protein was less than j, the lacking residue was filled with a dummy residue ''X'' [28]. (v) Those peptide samples thus obtained were put into the positive subset S z j if their centers have been experimentally confirmed as the nitrotyrosine sites; otherwise, into the negative subset S { j . By following the aforementioned procedures, five such benchmark datasets (S j~6 ,S j~7 ,S j~8 ,S j~9 , and S j~10 ) had been constructed. Each of these datasets contained 1,044 nitrotyrosine peptides and 7,669 non-nitrotyrosine peptides. Note that the sample numbers thus obtained have some minor difference with those in [27]. This is because some proteins originally used in [27] have been removed or replaced in the updated version of the Uni-Prot database.
However, it was observed via preliminary trials that when j~9, i.e., the peptide samples concerned were formed by 19 residues, the corresponding results were most promising (see Fig. 4 and Fig. 5). Accordingly, we choose S j~9 as the benchmark dataset for further investigation. Thus, Eq. 3 can be reduced to where S~S 9 , S z~Sz 9 containing 1,044 nitrotyrosine peptide samples, and S {~S{ 9 containing 7,669 non-nitrotyrosine peptide samples. The detailed 19-tuple peptide sequences and their positions in proteins are given in Supporting Information S1.

Feature Vector and Pseudo Amino Acid Composition
One of the most important but also most difficult problems in computational biology today is how to effectively formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information. This is because all the existing operation engines, such as correlation angle approach [29], covariance discriminant [30], neural network [31], support vector machine (SVM) [32], random forest [33], conditional random field [28], K-nearest neighbor (KNN) [34], OET-KNN [35], Fuzzy K-nearest neighbor [36], ML-KNN algorithm [37], and SLLE algorithm [30], can only handle vector but not sequence samples. However, a vector defined in a discrete model may totally miss the sequence-order information. To deal with such a dilemma, the approach of pseudo amino acid composition [38] or Chou's PseAAC [39] was proposed. Ever since it was introduced in 2001 [38], the concept of PseAAC has been rapidly penetrated into almost all the areas of computational proteomics, such as in identifying bacterial virulent proteins [40], predicting anticancer peptides [41], predicting protein subcellular location [42], predicting membrane protein types [43], analyzing genetic sequence [44], predicting GABA(A) receptor proteins [45], identifying antibacterial peptides [46], predicting anticancer peptides [41], identifying allergenic proteins [47], predicting metalloproteinase family [48], identifying GPCRs and their types [49], identifying protein quaternary structural attributes [50], among many others (see a long list of references cited in a 2014 article [51]). Recently, the concept of PseAAC was further extended to represent the feature vectors of DNA and nucleotides [9], as well as other biological samples (see, e.g., [52]). Because it has been widely and increasingly used, recently three types of powerful open access soft-ware, called 'PseAAC-Builder' [53], 'propy' [54], and 'PseAAC-General' [51], were established: the former two are for generating various modes of Chou's special PseAAC; while the 3 rd one for those of Chou's general PseAAC.
According to a comprehensive review [8], PseAAC can be generally formulated as where T is the transpose operator, while V an integer to reflect the vector's dimension. The value of V as well as the components y u (u~1,2, Á Á Á ,V) in Eq. 6 will depend on how to extract the desired information from a protein/peptide sequence. Below, let us describe how to extract the useful information from the benchmark datasets to define the peptide samples via Eq. 6.
For convenience in formulation, let rewrite Eq. 1 as follows where R jz1 , the residue at the center of the peptide, is tyrosine (Y), and all the other residues R i (i=jz1) can be any of the 20 native amino acids or the dummy code X as defined above.
Hereafter, let us use the numerical codes 1, 2, 3, …, 20 to represent the 20 native amino acids according to the alphabetic order of their single letter codes, and use 21 to represent the dummy amino acid X. Accordingly, the number of possible different dipeptides will be 21|21~441, and the number of dipeptide subsite positions on the sequence of Eq. 7 will be (2jz1{1)~2j. Now, let us introduce a positive and a negative PSDP (positionspecific dipeptide propensity) matrix, as given below ð8aÞ where the element and In Eq. 9, F z (D i jj) is the occurrence frequency of the iÀth dipeptide (i = 1,2,Á Á Á ,441) at the jÀth subsite on the sequence of Eq. 7 (or the jÀth column in the positive subset dataset S z ) that can be easily derived using the method described in [55] from the sequences in the Supporting Information S1; while F { (D i jj) is the corresponding occurrence frequency but derived from the negative subset dataset S { . Thus, for the peptide sequence of Eq. 7, its attribute to the positive set S z or negative set S { can be formulated by a 2j-D (dimension) vector P z or P { , as defined by [23] P z j~y z 1 where where R u and R uz1 represent the residues in the uÀth and (uz1)Àth positions of the peptide concerned.

Discriminant Function Approach
Now in the 2j-D space, let us define an ideal nitrotyrosine peptide II z [22] and an ideal non-nitrotyrosine peptide II { as expressed by where l z i (i~1,2, Á Á Á ,2j) is the upper limit of the corresponding matrix element in Eq. 12a, and l { i (i~1,2, Á Á Á ,2j) is the upper limit of the corresponding matrix element in Eq. 12b. Theoretically speaking, each of these hypothetical upper limits in Eq. 13 should be 1 [23]. Thus, the similarity score of P z j with II z and that of P { j with II { can be defined as Similar to the treatment in [23], let us define a discriminant function D given by where < is the adjust parameter used to optimize the overall success rate when the positive and negative benchmark datasets are highly imbalanced in size. Now the peptide P j of Eq. 7 can be identified according to the following rule P j belongs to nitrotyrosine peptide, if D j w 0 P j belongs to non À nitrotyrosine peptide, if D j ƒ 0 ( ð16Þ The predictor obtained via the above procedures is called iNitro-Tyr. How to properly and objectively evaluate the anticipated accuracy of a new predictor and how to make it easily accessible and user-friendly are the two key issues that will have important impacts on its application value [56]. Below, let us address these problems. ½ along a protein sequence. During the sliding process, the scales on the window are aligned with different amino acids so as to define different peptide segments. When, and only when, the scale 0 is aligned with Y (tyrosine), is the 2jz1 ð Þ À tuble peptide segment seen within the window regarded as a potential nitrotyrosine peptide. Adapted from Chou [55,77]

Metrics for Scoring Prediction Quality
In literature the following four metrics are often used to score the quality of a predictor at four different angles where TP represents the number of the true positive; TN, the number of the true negative; FP, the number of the false positive; FN, the number of the false negative; Sn, the sensitivity; Sp, the specificity; Acc, the accuracy; MCC, the Mathew's correlation coefficient. To most biologists, unfortunately, the four metrics as formulated in Eq. 17 are not quite intuitive and easy-tounderstand, particularly the equation for MCC. Here let us adopt the formulation proposed recently in [9,11,28] based on the symbols introduced by Chou [25,55] in predicting signal peptides. According to the formulation, the same four metrics can be expressed as where N z is the total number of the nitrotyrosine peptides investigated while N z { the number of the nitrotyrosine peptides incorrectly predicted as the non-nitrotyrosine peptides; N { the total number of the non-nitrotyrosine peptides investigated while N { z the number of the non-nitrotyrosine peptides incorrectly predicted as the nitrotyrosine peptides [57]. Now, it is crystal clear from Eq. 18 that when N z {~0 meaning none of the nitrotyrosine peptides was incorrectly predicted to be a non-nitrotyrosine peptide, we have the sensitivity Sn~1. When N z {~N z meaning that all the nitrotyrosine peptides were incorrectly predicted as the non-nitrotyrosine peptides, we have the sensitivity Sn~0. Likewise, when N { z~0 meaning none of the non-nitrotyrosine peptides was incorrectly predicted to be the nitrotyrosine peptide, we have the specificity we have Acc~0:5 and MCC~0 meaning no better than random prediction. As we can see from the above discussion based on Eq. 18, the meanings of sensitivity, specificity, overall accuracy, and Mathew's correlation coefficient have become much more intuitive and easier-to-understand.
It is instructive to point out, however, the set of metrics in Eqs. 17-18 is valid only for the single-label systems. For the multi-label systems, such as those for the subcellular localization of multiplex proteins (see, e.g., [58][59][60][61][62]) where a protein may have two or more locations, and those for the functional types of antimicrobial peptides (see, e.g., [63] where a peptide may possess two or more functional types, a completely different set of metrics is needed as elaborated in [37].

Jackknife Cross-Validation
With a set of clear and valid metrics as defined in Eq. 18 to measure the quality of a predictor, the next thing we need to consider is how to objectively derive the values of these metrics for a predictor.
In statistical prediction, the following three cross-validation methods are often used to calculate the metrics of Eq. 18 for evaluating the quality of a predictor: independent dataset test, subsampling test, and jackknife test [64]. However, of the three test methods, the jackknife test is deemed the least arbitrary that can always yield an unique result for a given benchmark dataset [65]. The reasons are as follows. (i) For the independent dataset test, although all the samples used to test the predictor are outside the training dataset used to train it so as to exclude the ''memory'' effect or bias, the way of how to select the independent samples to test the predictor could be quite arbitrary unless the number of independent samples is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset [64]. (ii) For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold or 10-fold cross-validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset, as demonstrated by Eqs.28-30 in [8]. Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield an unique outcome cannot be deemed as a good one. (iii) In the jackknife test, all the samples in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each sample will be in turn moved between the two. The jackknife test can exclude the ''memory'' effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. Accordingly, the jackknife test has been increasingly used and widely recognized by investigators to examine the quality of various predictors (see, e.g., [33,41,43,[45][46][47][66][67][68][69][70][71][72]). Accordingly, in this study we also used the jackknife crossvalidation method to calculate the metrics in Eq. 18 although it would take more computational time.

Comparison with Other Methods
The jackknife test results by iNitro-Tyr on the benchmark dataset S~S z S S { (cf. Supporting Information S1) for the four metrics defined in Eq. 18 are listed in Table 1, where for facilitating comparison, the corresponding results by GPS-YNO2 [27] with different thresholds are also given.
From the table, we can see the following facts. (i) The overall accuracy by the current iNitro-Tyr predictor is Acc~84:52%, which is higher than the overall accuracy by GPS-YNO2 regardless what threshold is used for the latter. (ii) The Mathew's correlation coefficient obtained by iNitro-Tyr is MCC~0:4905, which is significantly higher than that by GPS-YNO2, indicating that the new predictor is more stable and less noisy. (iii) The sensitivity and specificity obtained by iNitro-Tyr are Sn~81:76% and Sp~85:89%, which are much more evenly distributed than those by the GPS-YNO2 predictor.
It is instructive to point out that, as shown by Eqs. 12a and b, the amino acid pairwise coupling effects [11] has been incorporated via the general form of PseAAC [8] to formulate the peptide samples. If, however, we just used the single amino acid specific position occurrence frequency to formulate the peptide samples, the corresponding prediction quality would drop down to Acc~44:88% and MCC~0:1656, clearly indicating that consideration of the amino acid pairwise coupling effects could significantly enhance the prediction quality, fully consistent with the reports by previous investigators [73,74], where it was observed that the prediction of protein secondary structural contents had been remarkably improved by taking into account the amino acid pairwise coupling effects.
Accordingly, compared with the best of existing predictors for identifying the nitrotyrosine sites in proteins, the new iNitro-Tyr predictor not only can yield higher or comparable accuracy, but is also much more stable and less noisy. It is anticipated that iNitro-Tyr may become a useful high throughput tool in this area, or at the very least play a complementary role to the existing predictors.

Web-Server and User Guide
For the convenience of most experimental scientists, we have established a web-server for the iNitro-Tyr predictor, with which users can easily get their desired results according to the steps below without the need to understand the mathematical equations in the method section.
Step 1. Open the web server at http://app.aporc.org/iNitro-Tyr/and you will see the top page of the predictor on your computer screen, as shown in Fig. 6. Click on the Read Me button to see a brief introduction about iNitro-Tyr predictor and the caveat when using it.
Step 2. Either type or copy/paste the sequences of query proteins into the input box shown at the center of Fig. 6. All the input sequences should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with the symbol ''.'' in the first column, followed by lines of sequence data in which amino acids are represented using single-letter codes. Except for the mandatory symbol ''.'', all the other characters in the single initial line are optional and only used for the purpose of identification and description. The sequence ends if another line starting with the symbol ''.'' appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the open box. Note that if your input protein sequences should be formed by the 20 native amino acid codes (ACDEFGHIKLMNPQRSTVWY).
Step 3. Click on the Submit button to see the predicted results. For example, if you use the two query protein sequences in the Example window as the input, after clicking the Submit button, you will see the following on your screen. (i) The 1 st protein (P05181) contains 18 Y residues; of which only those located at the sequence position 71, 318, 349, 381, and 423 are of nitrotyrosine site, while all the others are of non-nitrotyrosine site. (ii) The 2nd protein (P03023) contains 8 Y residues; of which only those located at the sequence positions 7, 12, 17, and 47 belong to the nitrotyrosine site, while all the others belong to non-nitrotyrosine site. All these results are fully consistent with experimental observations except for one Y residue at the position 349 in the 1 st protein (P05181) that is actually non-nitrotyrosine site but was overpredicted as nitrotyrosine site.
Step 4. As shown on the lower panel of Fig. 6, you may also submit your query proteins in an input file (with FASTA format) via the ''Browse'' button. To see the sample of input file, click on the Example button right under the input box.
Step 5. Click on the Data button to download the benchmark dataset used to train and test the iNitro-Tyr predictor.

Conclusions
As one of the important posttranslational modifications (PTMs), nitrotyrosine is a product occurring in proteins when their tyrosine (Tyr or Y) residue is nitrated. Since a remarkably increasing level of nitrotyrosine is detected for those patients who have suffered from rheumatoid arthritis, septic shock, and coeliac disease, knowledge of nitrotyrosine is very useful for developing drugs against these diseases.
A new predictor was developed for identifying the nitrotyrosine sites in proteins based on a set of 19-tuple peptides generated as follows. Sliding a window of 19 amino acids along each of the 546 protein sequences taken from a protein database, collected were Table 1. Comparison of the new iNitro-Tyr predictor with the existing predictors in identifying the nitrotyrosine sites; the rates listed below were derived by the jackknife cross-validation on the 546 source proteins used in [27]. only those peptide segments with Y (tyrosine) at the center, i.e., the potential nitrotyrosine-site-containing peptides. The benchmark dataset thus obtained contains 1,044 experiment-confirmed nitrotyrosine peptides and 7,669 non-nitrotyrosine peptides.
The new predictor is called iNitro-Tyr, in which each of the potential nitrotyrosine-site-containing peptides was formulated with a 18-D vector formed by incorporating the position-specific dipeptide propensity (PSDP) into the general form [8] of pseudo amino acid composition [38,75] or Chou's PseAAC [39,51,54].
It has been observed by the rigorous cross validations that the iNitro-Tyr not only yields higher success rates but also is more stable and less noisy as reflected by a set of four metrics generally used to measure the quality of a predictor from different angles.
For the convenience of most experimental scientists, the webserver of iNitro-Tyr has been established at http://app.aporc.org/ iNitro-Tyr/. Furthermore, to maximize their convenience, a stepby-step guide has been provided, by which users can easily get their desired results without the need to follow the complicated mathematics that were presented in this paper just for the integrity of the predictor.
It has not escaped our notice that the current approach can also be used to develop various effective methods for identifying the sites of other PTM sites in proteins.

Supporting Information
Supporting Information S1 The benchmark dataset used in this study contains 8,713 peptides formed by 19 amino acid residues with Y (tyrosine) at the center. Of these peptides, 1,044 are of nitrotyrosine and 7,669 of nonnitrotyrosine. Listed are also the codes of the source proteins from which these 19-tuple peptide sequences are derived as well as their corresponding sites in proteins. See the main text for further explanation. (DOC) Figure 6. A semi-screenshot to show the top page of the iNitro-Tyr srver. Its website address is at http://app.aporc.org/iNitro-Tyr/. doi:10.1371/journal.pone.0105018.g006