Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs

doi:10.1371/journal.pone.0129635

Fig 1.

Overview of the proposed pbPUP predictor.

The full-length sequence of a pupylated protein is first used to generate the PSSM profile by running PSI-BLAST search against the NCBI NR90 database. Meanwhile, the PSSM matrixes corresponding to pupylation and non-pupylation sites are extracted from the whole profile. The encoded profile-based features are used as the input to train a SVM classifier. After optimization of the SVM parameters, the best SVM model is constructed based on the 10-fold cross-validation performance. Finally, a web server pbPUP is implemented and made available for interested users to predict the potential pupylation sites from the submitted proteins.

More »

Expand

Table 1.

The statistics of pupylated proteins and their pupylation sites used in this study.

More »

Expand

Fig 2.

Performance comparison between pbCKSAAP and CKSAAP using ROC curves.

(A) Performance comparison based on 10-fold cross-validation of the training dataset; (B) Performance comparison based on the independent test dataset.

More »

Expand

Table 2.

The prediction performance of pbPUP and other existing predictors evaluated on the independent test dataset.

More »

Expand

Fig 3.

Sequence logo representations showing the amino acid occurrences between pupylation and putative non-pupylation sites.

Only residues that were significantly enriched or depleted (t-test, P<0.05) flanking the centred pupylation sites are shown. Panel A represent the two-sample logo of the iPUP training dataset, while panel B plots the two-sample logo of the independent test dataset. The two-sample sequence logos were prepared using the web server http://www.twosamplelogo.org/.

More »

Expand

Fig 4.

Comparison of the selected features in pbCKSAAP and CKSAAP using the χ² feature selection method.

(A) Feature scores of pbCKSAAP and CKSAAP; (B) The numbers of selected features in pbCKSAAP and CKSAAP with the same feature selection score cutoff χ²≥3.

More »

Expand

Fig 5.

Box plots of the average PSSM values (APV) of amino acids positioned in the upstream, center, and downstream regions of pupylation and non-pupylation sites.

Red color denotes pupylation sites, while green color denotes non-pupylation sites.

More »

Expand

Fig 6.

Top 30 amino acid pairs selected by the χ² feature selection method.

Red color denotes pupylation sites, while blue color denotes non-pupylation sites. The radar diagram is represented by the composition of each residue pair whose length is proportional to the composition of pbCKSAAP features.

More »

Expand

Fig 7.

The violin plots illustrating the positional distributions of the top 30 amino acid pairs of the pbCKSAAP encoding on the pupylated peptides.

(A) The distributions on the pupylated peptides from the training samples; (B) The distributions on the pupylated peptides from the independent testing samples. The white dots indicate the median values, the black boxes indicate the ranges between 1^st quartiles and 3^rd quartiles, while the outskirt violin-like shapes denote the probability destiny plots. For clarity, green dashed lines indicating the position of the central lysines are also added.

More »

Expand