The VHSE-Based Prediction of Proteasomal Cleavage Sites

Prediction of proteasomal cleavage sites has been a focus of computational biology. Up to date, the predictive methods are mostly based on nonlinear classifiers and variables with little physicochemical meanings. In this paper, the physicochemical properties of 14 residues both upstream and downstream of a cleavage site are characterized by VHSE (principal component score vector of hydrophobic, steric, and electronic properties) descriptors. Then, the resulting VHSE descriptors are employed to construct prediction models by support vector machine (SVM). For both in vivo and in vitro datasets, the performance of VHSE-based method is comparatively better than that of the well-known PAProC, MAPPP, and NetChop methods. The results reveal that the hydrophobic property of 10 residues both upstream and downstream of the cleavage site is a dominant factor affecting in vivo and in vitro cleavage specificities, followed by residue’s electronic and steric properties. Furthermore, the difference in hydrophobic potential between residues flanking the cleavage site is proposed to favor substrate cleavages. Overall, the interpretable VHSE-based method provides a preferable way to predict proteasomal cleavage sites.


Introduction
The ubiquitin-proteasome pathway (UPP) of protein degradation plays important roles in the cytosol and nucleus of eukaryotic cells e.g. removing misfolded, mutant, and damaged proteins [1], regulating the concentrations of regulatory proteins [2,3], digesting foreign and native proteins into small peptides and then participating in the initiation of adaptive immune response [4].
In eukaryotic cells, the most common form of proteasome is known as the 26S proteasome, which is composed of a 20S core particle capped by a 19S regulatory particle at one or both ends [5]. The 20S core particle is a stack of four heptameric rings, which are assembled to form a cylindrical structure [6]. The outer two rings are made of a subunits (a 1 ,a 7 ), which provide anchor sites for the 19S regulatory particle. The inner two rings are composed of b subunits (b 1 ,b 7 ), which form proteolytic active sites in a central cavity. Three catalytic activities located in b 1 , b 2 , and b 5 subunits are identified: peptidylglutamyl-peptide hydrolytic activity (cleavage after acidic residues); trypsin-like activity (cleavage after basic residues); and chymotrypsin-like activity (cleavage after hydrophobic residues) [7]. When cells are stimulated with pro-inflammatory cytokines, the b 1 , b 2 , and b 5 catalytic subunits can be replaced by three new catalytic subunits: b 1i , b 2i , and b 5i , respectively. This new form of proteasome is called immunoproteasome, as opposed to the constitutively expressed proteasome [8].
In the process of antigen presentation, the proteasomes can degrade proteins into peptides with 8,12 residues [9]. It has been proved that in most circumstance, the cleavage by proteasomes only generates the C-terminus of antigens, and the N-terminals of antigens are mainly trimmed by the peptidases in cytosol or endoplasmic reticulum (ER) [10,11]. Up to date, predictions of proteasomal cleavage sites have attracted considerable interests in computational biology. Three publicly available methods: PAProC [12,13], MAPPP [14,15], and NetChop [16] have been developed for predictions of proteasomal cleavage sites.
PAProC is a method for predicting cleavage sites by human proteasomes as well as wild-type and mutant yeast proteasomes. The influences of amino acids at different positions are assessed by using a stochastic hill-climbing algorithm based on the experimentally in vitro verified cleavage and non-cleavage sites; MAPPP is a method that combines proteasome cleavage predictions with MHC-binding predictions. FragPredict is a part of the MAPPP package that deals with the proteasome cleavage predictions. It consists of two algorithms. The first one uses a statistical analysis of cleavage -enhancing and -inhibiting amino acid motifs to predict potential proteasomal cleavage sites. The second one is based on a kinetic model of the 20S proteasome and takes the time-dependent degradation into account. This algorithm uses the results of the first algorithm as an input, and predicts which fragments are most likely to be generated. NetChop uses an artificial neural-network model that was built upon 18-residue peptide fragments consisting of full-length MHC-I ligands (9 residues) and the most proximal 9 residues flanking the C-terminus. At present, NetChop is known as the most successful method in cleavage site predictions. There are two versions of NetChop available, i.e. 1.0 and 2.0, and the later version is trained on a dataset 3 times larger than the 1.0 version. By comparing the predictive performance of PAProC, MAPPP, and NetChop, Saxova et al. [17] suggested that the predictions can still be improved, particularly if more degradation data become available.
Nussbaum et al. [18] demonstrated that certain amino acid characteristics in the positions flanking a cleavage site guide the selection of P1 residues by three active b subunits. Yael et al. [19] suggested that each position near the cleavage site contributes independently to the cleavage signal, and their contributions may be added. In light of these two points, 2607 MHC-I ligands from AntiJen database [20] and 489 in vitro digested data from IEDB database [21], are employed to construct a sequence-based prediction method. Characterized by VHSE amino acid descriptors [22], the physicochemical features of 14 residues upstream and downstream of the cleavage sites are used to establish prediction models by support vector machine (SVM). The in vivo and in vitro SVM models are further validated by two independent datasets (231 CTL epitopes and 48 in vitro degradation data [17]), respectively. The results show that the VHSE-based method is significantly superior to the well-known PAProC, FragPredict, and NetChop methods, in the consideration of predictive power and interpretability.

MHC-I Ligand Dataset
7324 MHC-I ligands associated with 230 human MHC-I alleles are extracted from the AntiJen database [20] (Dataset S1). The source protein sequences of these ligands are queried from the SWISS-PROT database [23]. The 7324 MHC-I ligands are pretreated according to the procedure in Figure 1 and total 2607 cleavage samples are obtained. The residues from N-terminal to C-terminal are denoted as Pn … P1 | P1' … Pn' (n = 14). The symbol ''|'' represents a cleavage site and the C-terminal of each MHC-I ligand is assigned as P1 position. In brief, the sequence with a span of 614 residues from a cleavage site forms a cleavage sample.
For each cleavage sample, the middle position of the MHC-I ligand is assigned as a non-cleavage site. Thus, the sequence with a span of 614 residues from this non-cleavage site forms a noncleavage sample. After removing sequences less than 28 residues, total 2480 non-cleavage samples are obtained. Overall, total 5087 training samples comprising 2607 cleavage samples and 2480 noncleavage samples are then used for SVM modeling (Dataset S2).
In vitro Cleavage Dataset 857 in vitro cleavage products come from IEDB database [21] (Dataset S3). These peptides with 8,11 amino acid residues are mainly from human respiratory syncytial virus (RSV) and koi herpes virus (KHV). The source protein sequence of each peptide is queried from the NCBI database [24]. The pretreatment method is the same as the MHC-I ligands. Finally, total 978 in vitro training data comprising 489 cleavage samples and 489 noncleavage samples are obtained for SVM modeling (Dataset S4).

Test Datasets
Two datasets from Saxova et al. [17] are used to validate the predictive power of the in vivo and in vitro SVM models, respectively. The first dataset comprises 231 MHC-I ligands, which are either known T cell epitopes or naturally processed peptides eluted from MHC molecules (Dataset S5 and S6). The second dataset includes 48 sequences which are digested from SSX-2 [25], HIV-Nef [26], and RUI proteins [27] by the human proteasomes (Dataset S7 and S8).

VHSE Structural Description
VHSE (principal component score vector of hydrophobic, steric, and electronic properties), a set of amino acid descriptors comes from Mei et al. [22]. A total of 18 hydrophobic properties, 17 steric properties, and 15 electronic properties of 20 natural amino acids are used for constructing VHSE descriptors by principal components analysis (PCA) [22], respectively. All physicochemical properties are auto-scaled prior to PCA analysis (SPSS 10.0). For the matrices of hydrophobic, steric, and electronic properties, the first 2, 2, and 4 principal components account for 74.33, 78.68, and 77.9% variances of original property matrices, respectively. These eight principal components can be used for characterizing 20 amino acids with less information loss. The eight score vectors are so-called VHSE descriptors, in which VHSE 1 and VHSE 2 are related to hydrophobic properties, VHSE 3 and VHSE 4 to steric properties, and VHSE 5 ,VHSE 8 to electronic properties (Table 1).
In order to reduce the number of variables, only VHSE 1 , VHSE 3 , and VHSE 5 , i.e. the first principal component score of each matrix are used for structural characterizations of cleavage/ non-cleavage samples. For example, a sample with 14 residues on either side of the cleavage site (614) can now be characterized by 2863 = 84 VHSE variables.

Support Vector Machine (SVM)
As a supervised learning method for classification, SVM [28,29] was originally proposed for solving the classification problem of linearly divisible samples. The core idea of SVM is to find an optimal separating hyperplane, which maximizes the distance of either class to this hyperplane, and minimizes the risk of misclassification. For nonlinear classification problem, SVM performs a nonlinear mapping from an input space to a highdimensional feature space, and then applies linear classification techniques in this high-dimensional space. The nonlinear mapping is accomplished by a kernel function: K(x,x i ) = W(x)?W(x i ). By introducing kernel functions, SVM can effectively avoid the problems of over-fitting, dimension disaster, and local optimum. Below are some useful kernel functions: Polynomial kernel function : Radial basis kernel function RBF ð Þ: Sigmoid kernel function : According to our experience and previous researches [30][31][32], the RBF kernel is usually superior to other non-linear kernel functions. Therefore, only linear and RBF kernels are used for SVM modeling. In this paper, SVM is implemented by SVM_light program [33]. Each VHSE variable is scaled linearly to [0, 1] before SVM modeling. The optimal values of C, e and c are determined by the results of 10-fold cross-validation.

Measures of Performance
The performance of SVM models is evaluated by accuracy (Acc), sensitivity (Sen), specificity (Spe), and Matthew's correlation coefficient (MCC), the definitions of which are shown in Equation Sen~T P TPzFN |100% ð6Þ Where TP is the number of true positives; TN is the number of false positives; FP is the number of true negatives and FN is the number of false negatives. The MCC is a balanced measure which can be used even if the classes are of very different sizes [34]. The area under receiver operating characteristics curve (AUC), a global threshold-independent measure of performance, is also used for model evaluations [35].

SVM Modeling
In order to examine the influence of sequence length on model performance, training samples with a span of 66, 68, 610, 612, and 614 residues from cleavage/non-cleavage sites are used to construct SVM models, respectively. The performance of the SVM models are shown in Table 2. For both in vivo and in vitro datasets, the model performance increases with the sequence length in the range of 66,610. However, the performance begins to decrease when the sequence length is beyond 610 residues. The results imply that residues outside the range of 610 have little contributions to substrate cleavages. Meanwhile, no significant difference is observed between linear and RBF kernels. In the consideration of complexity and interpretability, the linear SVM models are selected as the optimal models for both datasets, denoted by SVM MHC-I and SVM VITRO , respectively.
The predictive power of SVM MHC-I and SVM VITRO are further validated by two independent test sets provided by Saxova et al. [17], respectively. The overall predictive accuracies for SVM MHC-I and SVM VITRO model are 73.5% and 70.5%, respectively (Table 3). It is clear to see that the predictive power of SVM MHC-I and SVM VITRO are significantly better than that of PAProC, MAPPP, NetChop 1.0 and 2.0, especially in the level of MCC. Why our models generate more reliable predictions? There are 3 main reasons. Firstly, more training samples are involved in the SVM modeling. NetChop 2.0 is trained on 1110 MHC-I ligands, whereas SVM MHC-I on 2607 MHC-I ligands. Secondly, more residues, i.e. a span of 610 residues from the cleavage site, are considered in our models. Lastly, SVM MHC-I and SVM VITRO are established by SVM technique, which has better generalization capability and extendibility than the artificial neural network adopted by NetChop. However, the most important thing is that SVM MHC-I and SVM VITRO outperform the other models in model's interpretability. Following is a detailed analysis of proteasomal cleavage specificities based on SVM MHC-I and SVM VITRO models.

In vivo Cleavage Specificities of Proteasome
From the sequence information of proteasomal degradation products, it has become clear that the nature of the proteasome target sites cannot explain the cleavage specificities alone and the sequence context adjacent to a cleavage sites also play an important role [36][37][38]. From the results of SVM modeling, it can be indicated that 610 residues upstream and downstream of a cleavage site contribute to both the in vivo and in vitro cleavage specificities. The SVM MHC-I model is trained on naturally processed MHC-I ligands, thus, it can reflect the in vivo cleavage specificities of proteasomes. Figure 2 is the plot of weight coefficients of VHSE variables involved in SVM MHC-I . For convenience, the weight coefficients of VHSE 1 , VHSE 3 , and VHSE 5 , which characterize hydrophobic, electronic, and steric properties, are shown in Figure 2A, 2B, and 2C, respectively. Overall, the hydrophobic, electronic, and steric properties of residues are closely related to the cleavage specificities, especially for P9, P8, P7, P4, P1, P3', P4', and P5' positions.
As shown in Figure 2A, VHSE 1 variable at the P1 position has the largest positive weight coefficient (10.49). That is to say, the P1 position prefers hydrophobic residues. Falk et al. [39] found that hydrophobic Leu, Ile, Val, Thr, and Ala are the most abundant residues at the C-terminal (P1) of antigenic peptides. Earlier researches also indicated that the degradation products with hydrophobic C-terminal residues can be easily transferred to ER and bind to MHC molecules [40,41]. These are consistent with our results.
Besides P1 position, the weight coefficients of VHSE 1 upstream of the cleavage site are mainly positive, such as P2, P5, P7, P8, P9 and P10. However, the weight coefficients of VHSE 1 variables downstream of the cleavage site, except for P8', are negative. Namely, there is a significant difference in the weight coefficients of VHSE 1 between positions upstream and downstream of the cleavage sites. So, it can be inferred that hydrophobic potential flanking the cleavage site is beneficial for substrate hydrolysis.
In vitro experiments showed that Leu|Lys is a strong cleavage site [38]. According to VHSE 1 values of Leu (1.36) and Lys (21.17) together with the weight coefficient for each position, it can be inferred that Leu|Lys is a favorable combination for proteasomal cleavage.
From Figure 2B, it can be seen that that the VHSE 3 variables (steric property) of P1, P5', P4 and P9' positions have more influence on cleavage specificities. For P5' and P4 positions with negative VHSE 3 weight coefficients, bulky residues are unfavorable  to substrate cleavages. Nussbaum et al. [18] also proved that a small Pro is the most preferred at the P4 position for wild-type yeast 20S proteasome. According to the weight coefficients of VHSE 5 ( Figure 2C), electronic properties of residues at P1, P5', P9, P8, and P7' exert more influence on the cleavage specificities. Nussbaum et al. [18] observed that polar residues at P5' and P3 positions are clearly favored over non-polar ones for b5 active site, which is agreement with our results.
In general, the VHSE weight coefficients of P1, P8, and P9 positions are very similar to each other. These three positions are all inclined to select hydrophobic, bulky, and electro-positive residues. Also, the VHSE weight coefficients are similar for P2', P3', and P5', which tend to select hydrophilic, small, and electronegative residues. Interestingly, the preferences of P2', P3', and P5' are directly opposite to that of P1, P8, and P9. The profiles of in vivo cleavages are summarized in Table 4.

In vitro Cleavage Specificities of Proteasome
Compared with SVM MHC-I , the SVM VITRO model based on experimental in vitro data reflects in vitro cleavage specificities of proteasomes. Due to the differences between in vivo cellular environment and in vitro cell-free system, the cleavage specificities of proteasomes should be somewhat different. For reasons of convenience, the weight coefficients of VHSE 1 , VHSE 3 , and VHSE 5 for the SVM VITRO model are shown in Figure 3A, 3B, and 3C, respectively.
As was the case with the in vivo SVM MHC-I model, P1 position exerts the most important influence on the proteasomal cleavage, as shown in Figure 3. It is clear to see that VHSE 1 (hydrophobic) at the P1 position is a dominant variable affecting proteasomal cleavage. For P7, P8, and P9 positions, the VHSE 1 variables have relatively less influence on the proteasomal cleavage in comparison with the case of SVM MHC-I . Except for P3', the weight coefficients of the VHSE 1 variables downstream of the cleavage site are similar to the case of SVM MHC-I . Taken as a whole, hydrophobic potential difference flanking the cleavage sites is also beneficial to the in vitro proteasomal cleavages.
The contribution of VHSE 3 (steric) to the proteasomal cleavages is less than that of VHSE 1 ( Figure 3B). Compared with the case of SVM MHC-I ( Figure 2B), no significant steric hindrance effect is observed for residues in the vicinity of the cleavage site, which may be caused by the absence of cell environment.
Significant difference in the weight coefficients of VHSE 5 (electronic) is observed between the case of SVM VITRO ( Figure 3C) and SVM MHC-I ( Figure 2C). Interestingly, the signs of VHSE 5 weight coefficients in SVM VITRO seem to vary in an interval of 6 residual positions ( Figure 3C) . Compared with the case of SVM MHC-I , the influence of VHSE 5 at P1 and P5' positions on the substrate cleavages decreases significantly, while the influence of P2, P3, and P2' increases.
Overall, hydrophobic and electronic properties have more impact than steric properties on selection specificities in the in vitro system.

Conclusion
Based on SVM classification technology and VHSE description method, QSAR models with excellent predictive power are established for predicting proteasomal cleavage sites. The results show that hydrophobic property of residues flanking the cleavage site is a dominant factor affecting both the in vivo and in vitro cleavage specificities, followed by electronic and steric properties. The difference in hydrophobic potential between residues upstream and downstream of the cleavage sites is proposed to favor the substrate cleavages, especially for in vivo cleavages. For the in vivo SVM MHC-I model, the hydrophobic properties of the P1, P8, P9, and P5' play more important roles than that of other positions. In addition, the electronic and steric properties of P1 and P5' positions also have a great impact on the substrate cleavages. In comparison with the case of SVM MHC-I , the influence of residue's hydrophobic and steric properties on substrate cleavages seems to decrease in the case of SVM VITRO . However, the contribution of residue's electronic properties increases significantly, probably due to the solvation effect of the cell-free system.
In summary, compared to the well-known PAProC, FragPredict, and NetChop methods, the SVM MHC-I and SVM VITRO models are trained on larger datasets and have preferable predictive performance and interpretability. The studies presented in this paper would facilitate a deep understanding of the in vivo and in vitro selective cleavages as well as the cleavage mechanisms of the proteasomes.

Supporting Information
Dataset S1 The original data of MHC-I ligands. This excel workbook presents the 7324 MHC-I ligands extracted from the AntiJen database.