Identification of 14-3-3 Proteins Phosphopeptide-Binding Specificity Using an Affinity-Based Computational Approach

The 14-3-3 proteins are a highly conserved family of homodimeric and heterodimeric molecules, expressed in all eukaryotic cells. In human cells, this family consists of seven distinct but highly homologous 14-3-3 isoforms. 14-3-3σ is the only isoform directly linked to cancer in epithelial cells, which is regulated by major tumor suppressor genes. For each 14-3-3 isoform, we have 1,000 peptide motifs with experimental binding affinity values. In this paper, we present a novel method for identifying peptide motifs binding to 14-3-3σ isoform. First, we propose a sampling criteria to build a predictor for each new peptide sequence. Then, we select nine physicochemical properties of amino acids to describe each peptide motif. We also use auto-cross covariance to extract correlative properties of amino acids in any two positions. Finally, we consider elastic net to predict affinity values of peptide motifs, based on ridge regression and least absolute shrinkage and selection operator (LASSO). Our method tests on the 1,000 known peptide motifs binding to seven 14-3-3 isoforms. On the 14-3-3σ isoform, our method has overall pearson-product-moment correlation coefficient (PCC) and root mean squared error (RMSE) values of 0.84 and 252.31 for N–terminal sublibrary, and 0.77 and 269.13 for C–terminal sublibrary. We predict affinity values of 16,000 peptide sequences and relative binding ability across six permutated positions similar with experimental values. We identify phosphopeptides that preferentially bind to 14-3-3σ over other isoforms. Several positions on peptide motifs are in the same amino acid category with experimental substrate specificity of phosphopeptides binding to 14-3-3σ. Our method is fast and reliable and is a general computational method that can be used in peptide-protein binding identification in proteomics research.


Introduction
The 14-3-3 proteins are a highly conserved family of homodimeric and heterodimeric molecules, expressed in all eukaryotic cells [1]. As a key regulator of signal transduction, 14-3-3 isoforms participate in important cellular events including regulation of apoptosis, adhesiondependent integrin signaling, cell cycle control, DNA damage, metabolism and transcriptional regulation [2]. We have been particularly interested in understanding roles of different 14-3-3 isoforms in cell proliferation, cell cycle control, and human tumorigenesis.
In human cells, this family of proteins consists of seven distinct but highly homologous 14-3-3 isoforms: β, , η, γ, σ, τ, z [3]. Phosphate can bind to all of the 14-3-3 family and therefore being present at high intracellular concentration [4,5]. With roles of different 14-3-3 isoforms in a wide variety of signal transduction processes, 14-3-3σ is the only isoform directly linked to cancer in epithelial cells, which is regulated by major tumor suppressor genes [6][7][8]. The stabilizing ring-ring and salt bridge interactions unique to the 14-3-3σ homodimer structure are revealed by the x-ray crystal structure of 14-3-3σ with binding peptide, which potentially destabilized electrostatic interactions between subunits in 14-3-3σ-containing heterodimers, and rationalized preferential homodimerization of 14-3-3σ in vivo. The interaction of the phosphopeptide with 14-3-3 reveals a conserved mechanism for phospho-dependent ligand binding, implying that the phosphopeptide binding cleft is not the critical determinant of the unique biological properties of 14-3-3σ.
There exist many approaches identify substrate specificity of phosphopeptides that preferentially bind to 14-3-3σ over other isoforms. A major advance in understanding 14-3-3 phosphopeptide binding specificity was the recognition by Yaffe et al. [4] Using phosphoserineoriented peptide libraries, they identified a consensus hexapeptide binding motif, RXXpSXP, binding to all known 14-3-3 isoforms. The basic residue X means any of 20 amino acid types. Erik et al. [9] solved the x-ray crystal structure of 14-3-3σ, which provided structure information and demonstrated that 14-3-3σ preferentially form homodimers in cell. Unlike other six isoforms, they identified a second ligand binding sites involved in 14-3-3σ-specific ligand discrimination. In order to identify phosphopeptides that preferentially bind to 14-3-3σ over other isoforms, Lu et al. [10] used fragment-based combinatorial peptide microarray platform, dividing whole library into N-terminal and C-terminal sublibraries P −3 P −2 P −1 − p(S/T) − P +1 P +2 P +3 . The (+/−) represents relative position of p(S/T), and P +/− represents ten or five individual amino acids in each position. Ten different amino acid building blocks (R, E, F, L, Q, A, G, V, K, P) for P +/−1 P +/−2 and a total of five different amino acid building blocks (R, E, F, L, P) for P +/−3 positions were used. The phosphopeptide library was synthesized to get 14-3-3σ-specific binding peptide. They confirmed the previous consensus binding motif by Yaffe, and finally identified two 14-3-3σ-specific binders. However, their experimental methods are expensive and time consuming. Sequence variation at other positions near the phosphorylated site can cause differences in binding affinities, thus we can use the physical-chemical information to construct a computational model to extrapolate 14-3-3σ-specific binders from experimental data.
Roughly speaking, three categories of computational methods for detecting protein interactions exist. They are based on the evolution of information, natural language processing, the feature of the amino acid sequence and three-dimensional structural information. First, the evolution information [11] is extracted from multiple sequence alignment of homologous proteins. Family tree similarities are quantify tree similarities implemented a simple linear correlation between distance matrices of two protein families, as a proxy of their phylogenetic trees [12][13][14][15]. However, their computational tasks are huge. Second, methods based on Natural Language Processing (NLP) [16] can find the evidence for protein interactions from relevant scientific literatures. The problem is some binding information can not entirely appear in the literature in time. Using the hidden internal structure buried into noisy amino acid sequences [17][18][19] and some machine learning algorithms, some researchers propose prediction methods only using protein sequence information. Using three-dimensional structural information, Zhang et al. [20] predicted protein interaction with a considerable accuracy and coverage that are superior to predictions based non-structural evidence. Base on pairwise similarity method and primary structure of protein, Zaki et al. [21] measured similarity between protein sequences to predict protein binding residues. Since 14-3-3 phosphopeptide binders only have six meaningful positions in binding motif sequences, the state-of-the-art methods must be not suitable for this issue, how to dig the useful and important features is the first challenge.
In this paper, we propose the first computational method to identify and analysis 14-3-3 phosphopeptide binding specificity. We present a novel method for identifying peptide motifs binding to 14-3-3 isoforms. First, we propose a sampling criteria to build a predictor for each new peptide motif. Then, we select nine physicochemical properties of amino acids to describe each peptide motif. We also use auto cross covariance [22,23] to extract correlative properties of amino acids in any two positions. Finally, we consider elastic net [24] to predict affinity values of peptide motifs, based on ridge regression and least absolute shrinkage and selection operator (LASSO). Our method verifies 1,000 known peptide motifs binding to seven distinct but highly homologous 14-3-3 isoforms. On 14-3-3σ isoform, our method has overall pearsonproduct-moment correlation coefficient (PCC) and root mean squared error (RMSE) values of 0.84 and 252.31 for N-terminal sublibrary, and 0.77 and 269.13 for C-terminal sublibrary. It demonstrates the rationality of our computational method. Our method tests on 16,000 peptide sequences to predict binding affinity values, and relative binding ability across six permutated positions similar with the experimental value. We identify phosphopeptides that preferentially bind to 14-3-3σ over other isoforms. Several positions on peptide motifs are in the same amino acid category with experimental substrate specificity of phosphopeptides binding to 14-3-3σ.

Materials and Methods
We present an affinity-based computational approach for identifying peptide motifs binding to 14-3-3 isoforms, and this novel method is also the first computational method of 14-3-3 proteins phosphopeptide-binding specificity identification. For each 14-3-3 isoform, we have 1,000 peptide motifs with experimental binding affinity values, treated as known in this study. We need to identify affinity values of 16,000 peptide sequences binding to seven 14-3-3 isoforms. First, we propose a sampling criteria to build a predictor for each new peptide motif. Then, we select nine physicochemical properties of amino acids to describe each peptide motif. We also use auto cross covariance to extract correlative properties of amino acids in any two positions. Finally, we consider elastic net to predict affinity values of peptide motifs, based on ridge regression and least absolute shrinkage and selection operator (LASSO). The method flow is shown in Fig 1.

Data Set
Lu [10] proposed a fragment-based combinatorial peptide microarray, which enables sufficient coverage of all (P −3 P −2 P −1 − p(S/T) − P +1 P +2 P +3 ) sequences with only 1,000 peptide motifs (500 N-terminal and C-terminal sublibraries). These peptide motifs are formed as a phosphopeptide library. In a predefined manner, they use a total of ten different amino acid building blocks (R, E, F, L, Q, A, G, V, K, P) for P +/−1 and P +/−2 positions, and a total of five different amino acid building blocks (R, E, F, L, P) for P +/−3 position.
With respect to each N-terminal and C-terminal, there are 5 × 10 × 10 possibilities. For each 14-3-3 isoform, we have 1,000 peptide motifs with experimental binding affinity values. In order to study 14-3-3 proteins phosphopeptide-binding specificity from a global search space, which means there are 20 × 20 × 20 possibilities in each N-terminal and C-terminal. We will identify affinity values of 16,000 peptide sequences binding for seven 14-3-3 isoforms. To maximize the number of peptide motifs, twenty amino acids, instead of ten and five, are used at P +/−1 , P +/−2 and P +/−3 positions.

Sampling Criteria
We propose a sampling criteria to build a predictor for each new peptide motif. If all 500 peptide motifs for one terminal are used to construct a regression model, the predictor would be confused due to importing many irrelevant peptide sequences. For each new peptide sequence, we only select relevant peptide motifs to construct a dynamic regression model, which can improve average precision of the predictor.
All amino acids can be divided into five categories [25]: amino acids with positive charged side chains, amino acids with negative charged side chains, amino acids with polar uncharged side chains, amino acids with hydrophobic side chains and special cases. The details are shown in Table 1. For each new peptide sequence, we select the relevant peptide motifs with at least one P 1/2/3 position in the same category.

Feature Extraction
Based on relevant peptide motifs, we extract a set of features from the peptide sequences. There are two kinds of features in this study: one extracts nine physicochemical properties for each position and this produces 27 features; the other extracts correlation of amino acids in any two positions by auto-cross covariance, nine features for every two positions, thus leads to another 27 features [26].
We select nine physicochemical properties of all 20 amino acid types to describe each peptide motif: hydrophobicity, hydrophicility, volumes of side chains, polarity, polarizability, solvent-accessible surface area (SASA), net charge index (NCI) of side chains, mass, and hydrogen bond. Details are shown in Table 2 [26]. These nine physicochemical properties are normalized to zero mean and unit standard deviation [22,26], and the first kind of 27 features can be extracted by these normalized properties as follows: where P j represents the mean of the j-th property, P i,j is the j-th property of the i-th amino acid, S j is the corresponding unit standard deviation. We also use auto-cross covariance to extract correlation of amino acids in any two positions. Auto-cross covariance (ACC) can get two kinds of variables, auto cross (AC) between the same descriptor, and cross covariance (CC) between two different descriptors. In this study, we only use AC variables in order to avoid generating too large number of variants. We modify the AC variables to get correlation of amino acids in any two positions as follows: where m, n are different position of a peptide and j is the j-th property of residues, X i,j is the jth property of residue on the i-th position.

Linear Regression
After feature extraction described above, a suitable regression model should be selected to built an accurate predictor. Linear regression is one of the most widely used regression model in mathematical statistics, which has very good interpretability [27]. It not only gets a series of regression coefficient, but also explains how important one variable is, thus is very important in this study. We consider naive linear regression model to built an accurate predictor. Given feature vectors X 1 , Á Á Á, X p describing p features on each peptide sequence, we identify its corresponding value f(X) to represent binding affinity value as follows: Different linear regression models, i.e. ridge regression and LASSO, adopt different methods to minimize the residual sum of squares (RSS). Ridge regression minimizes the RSS subject to a bound on L2-norm of coefficients as follows: where λ controls the penalty of coefficient size, and N is the number of peptide motifs. LASSO tends to truncate some coefficients exactly at zero and hence makes model interpretable [28,29]. It minimizes RSS subject to a bound on L1-norm of coefficients [28], which is the sum of absolute values of coefficients, the equation is as follows: Considering pairwise correlations between 54 variables, we use elastic net to predict affinity values of peptide motifs. Zou [24,30] proposed elastic net, a new regularization and variable selection method, which combines ridge regression and LASSO by making a trade-off in these two penalties. The elastic net calculates corresponding value of each peptide sequence as follows: where P a ðbÞ ¼ We can calculate a ten-fold cross-validation to get the optimal λ for elastic net. In order to find the most suitable α, we produce a sequence from 0 to 1 with interval of 0.1. We apply 11 values of α to get the most suitable predictor.

Results
In this section, we have done three kinds of experiments. First, our method verifies the 1,000 known peptide motifs binding to seven distinct but highly homologous 14-3-3 isoforms. Second, our method tests on 16,000 peptide sequences to predict binding affinity values. Third, we identify phosphopeptides that preferentially bind to 14-3-3σ over other isoforms.

Verification on 1,000 known peptide motifs
Our method verifies 1,000 peptide motifs binding to seven 14-3-3 isoforms. The Pearson-product-moment correlation coefficient (PCC) and the root mean squared error (RMSE) [31] are used to evaluate performance as follows: where D contains all of relevant binding motifs, e is the average binding affinity, e i denotes experimental binding affinity value of the i-th peptide sequence, p i denotes the predicted affinity value of the i-th peptide sequence. An accurate predictor will get PCC = 1, RMSE = 0. We using the 999 peptide motifs with experimental binding affinity values as training data, removing the predicted peptide sequence. When only selecting 'relevant' data for building the predictor, about 300 peptide motifs are selected as training data each time on average. Details on identifying peptide motifs binding to 14-3-3 isoforms are shown in Table 3. On the 14-3-3σ isoform, our method has overall PCC and RMSE values of 0.84 and 252.31 for N-terminal sublibrary, and 0.77 and 269.13 for C-terminal sublibrary. It yields a considerable PCC in all seven isoforms, and the results clearly highlight the effectiveness of our method. At the same time, the RMSE values vary in different isoforms, because of several extra large values of affinity and imbalance peptide distribution between diverse values in different isoforms.
For each peptide motif to be predicted, we use ten-folds cross-validation to get the most appropriate regression model. The cross-validation results over 1,000 peptides are as showed in S1 Table. Comparison to Experimental Techniques. We produce a position-specific scoring matrix [32] on the top 50 motifs identified from each N-terminal and C-terminal sublibrary against each individual 14-3-3 isoform, to reflect position specialty for each amino acid, as shown in Fig 2. The height of each letter represents weighted contribution of that amino acid to the overall peptide binding. Our method is compared with the experimental methods from Lu [10], as summarized in Table 4. Our computational results are consistent with the previous experimental works on 14-3-3 isoforms binding peptide motifs. We get relative binding ability of all seven 14-3-3 isoforms across six permutated positions, as shown in Fig 3. Each bar represents the frequency of a particular amino acid. This confirms highly homologous feature of 14-3-3 isoforms, similar with consensus binding motif RXXpSXP. It is obvious that all of the seven isoforms strongly select peptide motifs containing Arg on P −3 position and Pro on P +2 position.
Comparison to Computational Methods. In this study, we use Elastic Net as regression model, which gets a better result and costs less time, comparing to other techniques. The quantitative comparison with other techniques, such as Simple Linear Regression, Support Vector Regression with RBF kernel and Neural Network with one hidden layer, are as show in Table 5.

Prediction on 16,000 peptide sequences
We using the 1,000 peptide motifs with experimental binding affinity values as training data, and aim to predict affinity values of 16,000 motifs for each 14-3-3 isoform. Our method predicts affinity values of all 16,000 peptide sequences binding to seven 14-3-3 isoforms. Our  Table 4. 14-3-3 preferences determined with different methods on 1,000 peptide motifs.
Position Relative to p(S/T) results confirm highly conserved binding specificity amongst 14-3-3 isoforms, and uncover some new binding information. We produce a position-specific scoring matrix on the top 500 motifs identified from each N-terminal and C-terminal sublibrary against individual 14-3-3 isoforms, to reflect position specialty for each amino acid, as shown in Fig 4. We get the relative binding ability of seven 14-3-3 isoforms across six permutated positions, as shown in Fig 5. Our method is compared with the experimental methods from Yaffe [4], as summarized in Table 6. We find the relative binding ability across six permutated positions, which are similar with the experimental results. All of the seven isoforms select peptide motifs containing Arg or Lys on P −3 position; Cys and amino acids with hydrophobic side chain on P −2 position; basic residues on P −1 and P +3 positions, and amino acids with hydrophobic side chain having most of aromatic residues on P +1 position. On P +2 position, peptide motifs with Cys, Tyr, Met and Pro show strong selection; however there is just Pro in Yaffe's research, it may be because that Yaffe used all amino acids except Cys.

Specificity of 14-3-3σ binding peptide motifs
On the 1,000 known peptide motifs, we identify the top 100 peptide motifs, irrespective of Nterminal or C-terminal, binding each 14-3-3 isoform. We filter and identify consensus sequences present in all seven isoforms, giving a total of 51 unique peptide motifs, as shown in Table 7. Compared with Lu [10], 30 peptide motifs of our results are the same with experimental 46 binding sequences, which are represented by the ? label. In the same time, most of the left 21 peptides have the same type of amino acids in two positions. The precision and recall values for our method are 59% and 65%, respectively. It indicates that our computational method obtains great consistence with experiment results.
We identify four peptide motifs that have 14-3-3σ specificity, as shown in Table 8. The four peptide motifs belong to the top 100 sequences binding 14-3-3σ, but not being part of the top Identification of 14-3-3 Proteins Phosphopeptide-Binding Specificity Using a Computational Approach 100 sequences binding other 14-3-3 isoforms. Compared with two 14-3-3σ preferable binders of Lu, B1:LFGpSLLR and B2:LFGpSLVR, three motifs have residues in the same amino acid category on P −2 and P +1 positions, as shown in Table 1. On P −2 position, Ala along with Phe has Hydrophobic side chain; Phe and Leu on P +1 position have polar uncharged side chains simultaneously.
We define a similarity score between the our predicted 14-3-3σ-specific motifs and Lu's findings. If there exists the same amino acid category in one position, we can count 1. If there exists the same amino acid type, not just the same category, we can count 3. For three N-terminal motifs, the count values are 1, 3, and 4, respectively. For one C-terminal motif, the count value is 1. Then, we use a randomization experiment and iterate 1000 times, p-value for the Nterminal motifs is 0.032, and p-value for the C-terminal motif is 0.033. Consider the regular pvalue as 0.05, the prediction results of our computational method is significant.
On all 16,000 peptide motifs, we identify the top 500 peptide motifs binding each 14-3-3 isoform. We identify six peptide motifs having 14-3-3σ specificity, as shown in Table 9. Compared with two 14-3-3σ preferable binders, two motifs have residues in the same amino acid category on P −3 and P −1 positions as shown in Table 1, on P −3 position, Ile along with Leu has Hydrophobic side chain; Pro and Gly are all special amino acids on P +1 position; and all of four C-terminal motifs show strong selection of Met and Tyr on P +1 and P +2 positions. As well as Leu and Val in same position of Lu's motifs, they all have similar hydrophobic side chain.

Discussion
We present a novel method for identifying peptide motifs binding to 14-3-3 isoforms. For each 14-3-3 isoform, we have 1,000 peptide motifs with experimental binding affinity values. We identify affinity values of 16,000 peptide sequences binding to seven 14-3-3 isoforms. First, we propose a sampling criteria to build a predictor for each new peptide motif. Then, we select nine physicochemical properties of amino acids and extract correlative properties of amino  acids to describe each peptide motif. Finally, we consider elastic net to predict binding affinities of peptide motifs. Our method tests 16,000 peptide motifs binding to seven distinct but highly homologous 14-3-3 isoforms, and the relative binding ability across six permutated positions similar with the experimental value. We identify phosphopeptides that preferentially bind to 14-3-3σ over other isoforms. Most of positions on peptide motifs are in the same amino acid category with experimental substrate specificity of phosphopeptides binding to 14-3-3σ. It indicates that, regardless of how the data are analyzed, 14-3-3σ consensus binding motifs derived from our experiments are in excellent agreement with previous work. Our method is designed and implemented as a generalized method that can be used to accurately predict the binding affinity for peptide-protein interaction in proteomics research.