PalmPred: An SVM Based Palmitoylation Prediction Method Using Sequence Profile Information

Protein palmitoylation is the covalent attachment of the 16-carbon fatty acid palmitate to a cysteine residue. It is the most common acylation of protein and occurs only in eukaryotes. Palmitoylation plays an important role in the regulation of protein subcellular localization, stability, translocation to lipid rafts and many other protein functions. Hence, the accurate prediction of palmitoylation site(s) can help in understanding the molecular mechanism of palmitoylation and also in designing various related experiments. Here we present a novel in silico predictor called ‘PalmPred’ to identify palmitoylation sites from protein sequence information using a support vector machine model. The best performance of PalmPred was obtained by incorporating sequence conservation features of peptide of window size 11 using a leave-one-out approach. It helped in achieving an accuracy of 91.98%, sensitivity of 79.23%, specificity of 94.30%, and Matthews Correlation Coefficient of 0.71. PalmPred outperformed existing palmitoylation site prediction methods – IFS-Palm and WAP-Palm on an independent dataset. Based on these measures it can be anticipated that PalmPred will be helpful in identifying candidate palmitoylation sites. All the source datasets, standalone and web-server are available at http://14.139.227.92/mkumar/palmpred/.


Introduction
S-Palmitoylation (hereafter termed as palmitoylation) is a eukaryote specific [1], reversible post-translational protein modification, which covalently adds palmitate moiety (C16:0) to a cysteine residue through a thioester linkage [2,3]. It plays an important role in a number of cellular processes such as membrane-protein interaction [4], signal transduction [5], neuronal development [6], apoptosis [7], lipid raft targeting [8,9] and subcellular localization [10]. Thus accurate identification of palmitoylation sites may provide important clues to decipher the underlying mechanism in the above-mentioned processes. Experimental techniques employing proteomics and imaging methods can be used for detection of palmitoylation sites. However time and resources required to search palmitoylation sites in the huge number of protein sequences present in different databanks, limit their usage. Due to this reason, only a small number of palmitoylation sites have been identified experimentally to date. Therefore an effective and highly accurate in silico prediction method can be very useful in rapid identification of candidate palmitoylation site which can be targeted for further experimental verification.
In recent years a few computational methods have been reported to find out palmitoylation sites by using information carried in protein sequences. Zhou et al. [11] developed the first predictor CSS-Palm by adopting clustering and scoring strategy on the dataset containing 210 palmitoylation sites with Jack-Knife sensitivity of 82.16% and specificity of 83.17%. Another predictor NBA-Palm was created by Xue et al. [12] using Naive Bayes method which achieved the overall prediction accuracy of 86.74% in Jack-Knife cross-validation. Ren et al. [13] proposed version 2.0 of CSS-Palm and claimed significant improvement in performance over previous version. Wang et al. [14] added a new algorithm CKSAAP-Palm to this list which used composition of k-spaced amino acid pairs as the encoding scheme. Later Hu et al. [15] proposed another predictor, named IFS-Palm, based on the features of amino acid sequences using Nearest Neighbor Algorithm and successfully showed that the IFS-Palm achieved a significantly better performance over CKSAAP-Palm on an independent dataset. Recently one more predictor WAP-Palm [16] was reported having accuracy 85.99% and Matthews Correlation Coefficient (MCC) of 0.72 in 10 fold cross-validation.
Here we report a new support vector machine (SVM) based approach for palmitoylation site identification by using features extracted from the primary amino acid sequence information only. In order to build SVM model we extracted palmitoylated peptides of different window size and encoded the same with different input features namely sequence conservation (PSSM), secondary structure and disorder. The best result was achieved with the sequence conservation encoding on 11-mer peptide. Benchmarking results on independent datasets confirmed that the proposed method is more efficient than the recent predictors, IFS-Palm and WAP-Palm. A web-server and standalone package, termed PalmPred is also available at http://14.139.227.92/mkumar/palmpred/, to enable high throughput annotation of new palmitoylation sites.

Data Source
In this study, we used the dataset constructed for the development of IFS-Palm [15]. It is compiled from the Uniprot database [17] (Release: 15.9, 13-Oct-2009) by searching the keywords ''Field'' for 'Sequence annotation [FT]', ''Topic'' for 'Lipidation', ''Term'' for 'Palmitoyl cysteine', and ''Confidence'' for 'Experimental'. The dataset consists of 151 proteins, which include 1537 cysteine residues in total, of which 234 residues were experimentally verified, as palmitoylation sites and remaining 1303 were not palmitoylated. The dataset was further divided into training and independent test datasets, similar to the strategy adopted in IFS-Palm.
Training dataset. Out of the total of 151 proteins, 132 proteins having 207 experimentally verified palmitoylated cysteines and 1140 non-palmitoylated cysteines were used as training dataset (D train ).
Independent test datasets. Remaining 19 proteins having 27 experimentally verified palmitoylated cysteines and 163 nonpalmitoylated cysteines were used as an independent dataset (D1 ind ).
It was clear that proteins of D1 ind were not present in training dataset of IFS-Palm and our method but for other predictors this may not be the case. In order to benchmark the performance of our method vis-à-vis other, we created another independent dataset (D2 ind ) . For this, we used 54 yeast proteins in which palmitoylation sites were identified and described in [18]. Eight proteins, also present in training dataset D train were excluded from the D2 ind . The resulting D2 ind dataset contains 46 proteins in which palmitoylation sites have been identified experimentally. This dataset was also used for independent evaluation of our method. To include any recent addition of palmitoylation sites, proteins of D2 ind were also searched in Uniprot from Field ''Sequence annotation (FT)'', Topic ''Lipidation'' and Term ''S-palmitoyl cysteine''.
We also compiled two more datasets for assessing the performance of our method -D3 ind and D4 ind containing 10 and 17 proteins respectively in which several palmitoylation sites were experimentally confirmed. The dataset D3 ind was collected from [19]. The dataset D4 ind was taken from [20] and consists of synaptic, motor, channels, G-protein coupled receptor, focal adhesion and tight junction proteins. We did not find any Uniprot annotation for palmitoylation in D3 ind and D4 ind proteins.

Pattern Size for Feature Encoding
The first step of our work was to determine the optimal window length, W of the cysteine containing peptide which can give maximum performance for palmitoylation site prediction. In order to do this, we extracted peptide segments of different window sizes from each protein such that each W-mer peptide contained a cysteine, symmetrically flanked by (W-1)/2 residues. For terminal cysteine residue, where the flanking region had less than (W-1)/2 residues, appropriate number of dummy residue 'X' was added to complete the window.
Each peptide segment was assigned a label depending on the nature of central cysteine residue. The peptide segment having a palmitoylated central cysteine residue was labeled positive and a non-palmitoylated central cysteine residue was labeled as negative. Thus for each window we extracted a total of 207 and 27 positive labels from D train and D1 ind respectively. Similarly the number of negative labels in D train and D1 ind were 1140 and 163.

Feature Encoding
Conservation feature. This was obtained from positionspecific scoring matrix (PSSM) generated during PSI-BLAST [21] search against NR90 by three iterations of searching at e-value cut-off of 0.001 for inclusion of sequences in next iteration. The NR90 database was constructed from NR protein sequence database clustered at 90% sequence identity by using CD-HIT [22][23][24]. The PSSM contains the probability of occurrence of each type of amino acid residues at each position and hence can be considered as a measure of residue conservation at a given position. This means that evolutionary information for each amino acid is encapsulated in a vector of 20 dimensions and the size of PSSM for a protein with N residues is 20 x N. In the present work, since we were using a peptide of fixed length 'W' to encode a palmitoylation site, a corresponding sub-matrix of size W x 20 was extracted from each PSSM. In case of peptides containing 'X' (see previous section), each 'X' in PSSM was represented by '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 09.
Structural disorder feature. Disordered regions are known to be rich in binding sites and provide an important locus for diverse protein post-translational modifications such as methylation and acetylation [25]. A number of studies also reported that the incorporation of structural disorder increases the prediction accuracy [26,27]. Therefore, we also included structural disorder probability of each residue as an input feature to code the peptides. For this purpose, VSL2 predictor [28,29] was used which assigned a score between 0 and 1 to each residue. Higher value of VSL2 score (close to 1) shows lack of fixed 3-dimensional structure while lower value shows higher propensity of fixed structure. It means larger the score is, the more likely a residue lacks fixed structure. We assigned score 0 to each dummy residue 'X'.
Secondary structure feature. In their work Hu et al. [15] had reported that information of protein structure also plays an important role in the prediction of palmitoylation site. It indicates that if structural information of each amino acid can be provided into more explicit form, it may help to achieve better prediction of palmitoylation site. In the present study we provided probability of an amino acid to form each of the three secondary structures namely, helix, sheet and coil using standalone PSIPRED (Ver 3.3) [30] at default parameters. Here also NR90 was used to generate the PSSM. Similar to conservation feature, for secondary structure prediction each 'X' was given a hypothetical value of 90 0 09 to maintain uniformity with other amino acid scores.

Support Vector Machines
We employed Support Vector Machine classifiers (SVM) to predict if, for a given input feature vector, the central cysteine residue is palmitoylated or not. SVMs, designed by Vapnik [31], are computational algorithms, which can efficiently classify complex, non-linear and high-dimensional data. So, it has been used for developing a large number of bioinformatics applications [32][33][34][35][36]. SVM trains a classifier by mapping the input vectors in higher dimension space through kernel functions and separating them into two classes (represented as positive and negative labels) with the maximal margin and least error in the transformed space. The trained classifier can be used to predict in which of the two classes an unknown sample falls, with a high confidence level. In the current study, SVM model was built using SVM-light [37] which is freely available from http://svmlight.joachims.org/. We experimented with several values of cost-factor, kernel (polynomial and radial basis function kernels) and penalty parameter C on peptides of different window sizes taken from D train . The model with the best performance parameters was selected as the optimal model.

Cross-Validation
Cross-validation is a method to evaluate classifier performance. The independent dataset test, sub-sampling (k-fold cross-validation) and Jack-Knife analysis (leave-one-out) are the three popular methods for cross-validation. In k-fold cross-validation, the dataset is randomly divided into k non-overlapping sets, k-1 sets are used for training and the remaining set for testing. This process is repeated k times such that each set is used as test set once and overall performance is calculated by averaging over all test sets.
In the present study we used 'leave-one-out' cross-validation (LOOCV) which has been considered as the most objective method in comparison to other two methods [38][39][40][41][42][43]. LOOCV uses one example from dataset as testing data and the remaining as training data. In a complete cycle of LOOCV, each example is used as test. The LOOCV thus shows dynamic behavior of testing and training data where every sample is the training set to train models as well as the testing set to test model [44]. It can also exclude the memory effects that exist in the re-substitution test, and provides the unique results for a given benchmark dataset [45].

Classifier Evaluation Measures
We adopted threshold-dependent performance matrices namely Specificity (S p ), Sensitivity (S n ), Accuracy (A cc ), and Matthews Correlation Coefficient (MCC) to measure the prediction capability of our method. Sensitivity and specificity respectively are the percentage of correct predictions from positive (palmitoylated cysteines) and negative cases (non-palmitoylated cysteines). Accuracy (arithmetic mean of sensitivity and specificity) signifies the overall percentage of correctly predicted palmitoylated and nonpalmitoylated peptides. The MCC [46] is a measure of predictive capability of classifiers, which reflects both the sensitivity, and specificity of the prediction algorithm. It is considered as a more reliable measure of the quality of binary classifications and can be used for unbalanced dataset also [47,48]. The MCC value always ranges from -1 to 1. An efficient predictor will have positive correlation coefficient value. The value -1 and 0 represents opposite and random predictions respectively.
All of the above mentioned parameters can be defined as follows: The abbreviations t + , t 2 , f + and f 2 represent true positive, true negative, false positive and false negative respectively. True and false positives are the predicted palmitoylated peptides, which are in reality a palmitoylated, and non-palmitoylated peptide respectively. True and false negatives are the peptides predicted as nonpalmitoylated and are actually a non-palmitoylated and palmitoylated peptides respectively.

Performance of PSSM and Selection of Optimized Window
To get optimum pattern size, we used only the evolutionary information obtained from PSSM generated by PSI-BLAST search against NR90. The performance was analyzed for window sizes 5, 7, 9, 11, 13, 15 and 17. As shown in Figure 1, the overall performance increased steadily with increase in the window-size, attained the peak at 11 and started declining afterwards. The maximum performance, which was achieved by us for pattern size 11, was 79.23% sensitivity, 94.30% specificity and 91.98% accuracy with MCC 0.71 (detailed performance in Table 1). In rest of the work, window-size 11 and PSSM based model was considered as baseline model unless mentioned otherwise. Additional features were added to the baseline model to further improve the performance.

Integration of Structure Disorder Information in Sequence Profile
When we integrated the disorder scores of central cysteine and its flanking 5 amino acids (on each side) derived from VSL2, no change in performance was noticed. We obtained sensitivity of 79.23%, specificity of 94.30%, accuracy of 91.98% and MCC of 0.71, which is exactly same as the performance achieved using PSSM alone (Table 1). It is opposite to what observed by Hu et al. [15] that disordered region plays an important role in the cysteinepalmitoylation. In their work, Gao and Xu [49] had observed a very little difference in the mean disorder scores (as predicted by VSL2) for both S-palmitoylated and non-palmitoylated cysteine. This little difference between the disorder propensities may be the reason for not getting any improvement in the prediction accuracy.

Prediction using Information in Sequence Conservation and Secondary Structure
Computing the probability score to form each of the three secondary structures by an amino acid is also a way of providing order/disorder information. Hence we also used PSIPRED predicted secondary structure information along with PSSM as input and trained the SVM. With PSSM and secondary structure information combined together, we achieved the accuracy of 91.98% and MCC of 0.71. The corresponding values of sensitivity and specificity were 79.23% and 94.30% respectively.
Again the result did not show any improvement over baseline model. This shows that addition of secondary structure information was also not able to provide any extra information to the predictor.

Prediction using Information in Sequence Profile, Secondary Structure and Disorder
We also used a combination of both disorder and secondary structure likelihood of each residue of the peptide pattern to see the influence of both together. Contrary to our expectation we obtained no increase in accuracy of prediction. All the performance measures i.e., sensitivity, specificity, accuracy and MCC remained same as obtained with PSSM alone (Table 1).
Hence SVM model obtained with PSSM was considered the final prediction model in rest of the work and it is referred as PalmPred henceforth.

Comparison with Existing Methods
Comparison of LOOCV performance. The existing methods of palmitoylation site prediction are CSS-Palm 1.0, NBA-Palm, CSS-Palm 2.0, CKSAAP-Palm, IFS-Palm and WAP-Palm. As the training data of the available predictors, except IFS-Palm, is different from the PalmPred, direct comparison among these predictors with PalmPred might not be reasonable. As described in materials and methods PalmPred and IFS-Palm has similar training dataset, so we compared the performance during LOOCV between them only. The PalmPred reached sensitivity of 79.23%, specificity of 94.30%, accuracy of 91.98% and MCC of 0.71 whereas the IFS-Palm attained sensitivity of 68.60%, specificity of 94.65%, accuracy of 90.65% and MCC of 0.64 ( Table 2). The result shows that at comparable specificity, PalmPred achieved almost 10% higher sensitivity.
Comparison of independent dataset performance. In order to do an unbiased evaluation, it is essential to benchmark the performance on an independent dataset. We used two independent datasets namely D1 ind and D2 ind for benchmarking purpose (see materials and methods for detail).
The first dataset (D1 ind ) had a subset of 19 proteins out of total 151 proteins compiled by Hu et al. [15] for development and evaluation of IFS-Palm. The performance of CKSAAP-Palm, IFS-Palm and PalmPred was evaluated on D1 ind . As shown in Table 3, in comparison of CKSAAP-Palm, a significant difference was observed in the performance of PalmPred. When comparison was made between IFS-Palm and PalmPred, PalmPred achieved better sensitivity though the specificity was same ( Table 3). The result was consistent to the performance shown during LOOCV, where also PalmPred had achieved higher sensitivity and comparable specificity. While we were working on development of PalmPred, a new palmitoylation site prediction method, namely, WAP-Palm was published by Shi et al. [16]. As 12 out of 15 proteins constituting the independent dataset of WAP-Palm were part of PalmPred training data, we did not benchmark the performance of WAP-Palm vis-à-vis PalmPred. The dataset D2 ind was used for performance assessment of IFS-Palm, WAP-Palm and PalmPred. We took palmitoylation sites of D2 ind proteins predicted by IFS-Palm from [15]. As Shi et al. [16] had shown that WAP-Palm performed best at threshold 0.8 we used the same threshold for prediction. We observed that PalmPred identified 61 palmitoylation sites in 33 proteins. WAP-Palm predicted 21 palmitoylation sites in 15 proteins while IFS-Palm predicted 60 sites in 31 proteins (Table 4). When we made a comparison between PalmPred and IFS-Palm, it was observed that PalmPred predicted at least one palmitoylated site in 10 different proteins where IFS-Palm failed to predict even one site. When we compared the 24 experimentally verified palmitoylation sites by Roth et al. [18], the total number of sites predicted by WAP-Palm, IFS-Palm and PalmPred were 1, 3 and 11 respectively. For protein TLG2, Roth et al. [18] had estimated the palmitoylation at position 317 [15] but PalmPred predicted it at 316 (Table 4). We cross-checked the position in sequence of TLG2 (available at Uniprot) and found that cysteine was present at position 316.  When we analyzed the prediction of PalmPred vis-à-vis Uniprot annotation, we observed that PalmPred predicted 29 novel sites, failed to predict 4 sites and correctly predicted 32 sites. CSS-Palm 1.0, NBA-Palm and CSS-Palm 2.0 web-servers were not functional, so we could not compare these methods.

Database for PSSM Construction
One of the prerequisites to carry out the prediction in PalmPred is to first do the PSI-BLAST to generate input features i.e. PSSM. One major challenge in employing PSI-BLAST is that with increase in database size, PSI-BLAST search time also increases. Therefore, to speed up the PSSM generation, we used databases having less redundancy than NR90 and then evaluated the performance. For D1 ind proteins, we generated PSSM against NR80 and NR70 and checked their performance on the PalmPred model. NR80 and NR70 contained 80% and 70% redundancy reduced protein sequences respectively and were compiled from NCBI-NR protein sequences by using CD-HIT [22][23][24]. As shown in Table S1, with decrease in redundancy of NR database, the performance also decreased which was as reported by Ahmad and Sarai [50].

Comparison with Other Machine Learning Classifiers
Other than SVM, several machine learning approaches have been used to develop classifiers for predicting post-translational modification sites including palmitoylation [12,16,51]. So besides SVM, we also tested following three machine learning methods implemented in WEKA program [52]: Naïve Bayes, RBF Network and Random forest. Similar to the SVM each of these three classifiers was constructed by incorporating PSSM score on pattern size 11. Each classifier was trained and evaluated on the training dataset (D train ) using LOOCV. By comparing the prediction results of the Naïve Bayes, RBF Network and Random forest classifiers with SVM classifier (Table 5), it was found that SVM classifier achieved the highest specificity, accuracy and MCC. The performance on independent dataset D1 ind was also very poor for Naïve Bayes, RBF Network and Random forest classifiers ( Table 5). The comparison clearly shows that the SVM is an ideal choice among different machine learning methods available.

Web-Server
To make the optimized SVM model accessible to experimental biologists, we have developed PalmPred web-server and standalone package. The prediction output provides information about all cysteine containing peptides, the position and palmitoylation state of cysteines. The PalmPred web-server can take a maximum of 5 sequences at a time. For a query dataset of more than 5 sequences standalone version of PalmPred can be used. The PalmPred is freely available at http://14.139.227.92/mkumar/ palmpred/.

Performance Assessment of PalmPred
Recently two reports were published which experimentally established palmitoylation sites in a group of proteins. The first work was done by Nishimura and Linder [19] which experimentally identified palmitoylation sites in Rho GTPase proteins. The second work was reported by Oku et al. [20] on 17 candidate proteins predominantly expressed in brain. In order to further assess the reliability of PalmPred, we used the proteins of abovementioned work (referred as D3 ind and D4 ind respectively in materials and methods).
Nishimura and Linder [19] reported a novel motif, CCaX, which tandomly undergoes prenylation and palmitoylation at C-terminal. In order to prove their hypothesis they worked on a set of ten proteins. They experimentally determined palmitoylation sites for five proteins and also reported a protein, PLA2c, which is known to be palmitoylated but the site of palmitoylation present in this protein is unknown. When PalmPred was used to predict the palmitoylation site in these ten proteins, of five proteins whose palmitoylation sites were experimentally determined PalmPred could correctly determined palmitoylation sites of two of those proteins (Table 6). For PLA2c, PalmPred predicted the candidate palmitoylation site as amino acid 539 which is consistent with the observations of [19] i.e. the predicted position lies at second C of CCaX motif. Of the remaining four proteins (PRL-1, PRL-2, PDE6a and PDE6b), whose palmitoylation sites was not determined by Nishimura and Linder, in PRL-1, PalmPred correctly predicted palmitoylation site at 171, which follows the hypothesis proposed by [19] besides one additional site at position 104 (Table 6). But in PRL-2, PalmPred predicted site did not follow the CCaX motif rule. In PDE6a and PDE6b, PalmPred did not predict any palmitoylation site which might be actually the case, as canonical CaaX processing (i.e. proteolysis and carboxymethylation after prenylation of CaaX cysteine) of PDE6a and PDE6b is well documented [53].
Out of the 17 proteins tested as candidate for palmitoylation, Oku et al. [20] were able to experimentally establish the palmitoylation only for 10 sites (Table 7). PalmPred was able to correctly predict 5 sites out of them. One additional site (at position Cys-3) was also confirmed by the mutational analysis in neurochondrin which was also correctly predicted by PalmPred. Among the seven proteins whose palmitoylation couldn't be established by [20], in four proteins namely Homer 1C, Liprin-a2, Paxillin and Par3, PalmPred did not predict any palmitoylation site (Table 7). In remaining three proteins viz Kalirin7, KIF5C candidate site and palmitoylation sites were different while in one protein (Rab3A) both candidate and PalmPred predicted sites were same but no palmitoylation can be experimentally established.
One important thing we noticed with both datasets (D3 ind and D4 ind ) that despite very large number of cysteines in few proteins, PalmPred predicted palmitoylation site did not increased proportionally. Rather it shows robustness and high specificity of prediction of our method. One of the possible reasons behind slightly inferior performance of PalmPred can be due to novelty of datasets on which work of [19] and [20] is based as both tried to establish palmitoylation in a new group of proteins. Even Uniprot does not have any information of palmitoylation of these proteins. We feel that with addition of new information to the database, the performance can also be improved further.

Conclusions
In the present study we have described a novel machine learning tool called PalmPred to identify protein palmitoylation sites by using sequence conservation features. LOOCV and benchmarking results showed that PalmPred performed better than the other existing methods. Thus we hope PalmPred may serve as a useful tool to find potential palmitoylation sites in a protein. One downside of our approach is that it takes comparatively more time to generate evolutionary profile however we tried to resolve this issue up to a certain extent by evaluating the performance of PalmPred on less redundant data. The webinterface and standalone of PalmPred is available at http://14. 139.227.92/mkumar/palmpred/. The overall working schema for PalmPred is shown in Figure 2.

Author Contributions
Conceived and designed the experiments: MK. Performed the experiments: BK RK. Analyzed the data: MK BK RK. Wrote the paper: MK BK RK.