Conceived and designed the experiments: JJE BK. Performed the experiments: JJE BK. Analyzed the data: JJE BK. Wrote the paper: JJE BK.
The authors have declared that no competing interests exist.
Predikin is a system for making predictions about protein kinase specificity. It was declared the “best performer” in the protein kinase section of the Peptide Recognition Domain specificity prediction category of the recent DREAM4 challenge (an independent test using unpublished data). In this article we discuss some recent improvements to the Predikin web server — including a more streamlined approach to substrate-to-kinase predictions and whole-proteome predictions — and give an analysis of Predikin's performance in the DREAM4 challenge. We also evaluate these improvements using a data set of yeast kinases that have been experimentally characterised, and we discuss the usefulness of Frobenius distance in assessing the predictive power of position weight matrices.
Linear motifs — short, functional regions of proteins — play a vital role in signalling and the regulation of cellular processes
Experimental determination of kinase specificity is both expensive and time-consuming, and identification and validation of substrates can be even more laborious
Traditional computational domain recognition techniques are not well suited for identification of phosphorylation sites, and linear motifs in general, due to their short nature — typically less than 12 residues — and the probability of seeing false positives is always very high. Furthermore, the specificity of a protein kinase is determined not only by peptide specificity — the phosphorylation residue preference and composition of surrounding residues
We have previously described an algorithm, Predikin, for predicting peptide specificity of protein kinases and identifying substrates for protein kinases based on the concept of specificity-determining residues (SDRs)
PredikinDB has continued to be updated from the latest UniProtKB
To assess the ability of these new features to increase the number of protein kinases Predikin can make predictions for, and to evaluate their affect on accuracy, a published data set of 61 protein kinase from yeast was used. For each of these kinases, a position weight matrix, which described the sequence specificity surrounding the phospho-residue, had been experimentally determined
To successfully build a position weight matrix, the Predikin method relies on identifying similar specificity-determining residues, and this, in turn, is reliant on the substitution matrix used. Testing has shown that the use of different substitution matrices can enable Predikin to build position weight matrices for more protein kinases (by altering what Predikin considers similar to a specificity-determining residue). To analyse the benefits of using different substitution matrices, we attempted to build position weight matrices for each of the yeast protein kinases using various BLOSUM matrices. To assess the quality of Predikin's position weight matrices we used the same evaluation method as the DREAM4 challenge: similarity to a experimentally mapped position weight matrix using the distance induced by the Frobenius norm (Frobenius distance; see
From 16 BLOSUM matrices, BLOSUM30 clearly stands out as providing the most position weight matrices (
Each bar represents the number of kinases for which a position weight matrix could be built using each of 16 BLOSUM matrices. The blue bars show the number of position weight matrices built when using a cut-off value of 1, and the red bars show the number when using a cut-off value of 0. When considering just the number of position weight matrices, BLOSUM30 is clearly superior, and this is even more apparent when using a cut-off value of 0.
We calculated the Frobenius distance for the 12 protein kinases for which a position weight matrix can be built using all of the substitution matrices. For any given kinase, the distance produced does not vary greatly as the BLOSUM matrix changes (
The Frobenius distances achieved for 12 yeast kinases with various BLOSUM matrices using a cut-off value of 1 are shown. Each line represents one kinase; altering the BLOSUM matrix does not have a significant effect on distance as can be seen by the predominately horizontal lines.
Together these results show that we are able to increase the number of kinases Predikin can build position weight matrices for by changing the substitution matrix, and that BLOSUM30 captures the most kinases. We have also shown that the distance to the experimentally derived position weight matrix is not adversely effected by the use of BLOSUM30. We have also found that altering the substitution matrix cut-off value affects the number of position weight matrices that can be built. BLOSUM62 contains numbers ranging from −4 to 11 with higher numbers indicating more likely substitutions; by default, Predikin uses a cut-off value of 1, meaning that any substitution with a positive score is allowed; however, using a cut-off value of 0 greatly increases the number of kinases that position weight matrices can be built for, without affecting the accuracy of those position weight matrices. By using a cut-off value of 0 Predikin is able to build position weight matrices for many more protein kinases (
We also asked the question of whether using a cut-off value of 0 adversely affected the distances we obtained compared with using a value of 1. We calculated the distance from the experimentally derived position weight matrix for 12 kinases using a cut-off value of both 1 and 0. In four cases, the smallest distance was produced with a cut-off value of 1 (Cdc5, Gcn2, Hrr25 and Ste20) and, in a further four cases, a cut-off value of 0 gave the smallest distance (Tpk1, Tpk2, Tpk3 and Ypk1). In the remaining four cases (Cla4, Ipl1, Pkh2 and Prk1) the smallest distance was equal between cut-off values (
The Frobenius distance is shown for 12 kinases using BLOSUM62 and a cut-off value of 1 (blue) and 0 (red). In each case it is apparent that switching from a cut-off value of 1 to 0 has little effect on the Frobenius distance.
Each boxplot shows the distribution of p-values obtained from the set of 61 yeast protein kinase from
There remained five kinases that Predikin was unable to build specificity matrices for under any circumstances: Cak1, Kin1, Psk1, Sky1 and Ypl141c. Two of these (Cak1 and Sky1) are CMGC (a family of kinases including cyclin-dependent kinases, mitogen-activated kinases, CDK-like kinases and glycogen synthase kinases) kinases and the others are calmodulin-dependent kinases (CaMK). These are the two most represented groups in the kinases (37% CaMK and 25% CMGC kinases), and there are no consistent patterns with the specificity-determining residues of the kinases; therefore, we believe that the inability of Predikin to make predictions for these kinases is simply due to a lack of kinases with similar specificity-determining residues in PredikinDB, and that this will be rectified in time as our knowledge of kinase-substrate interactions grows.
During the course of our investigations, a different method of converting a frequency matrix to a position weight matrix was devised (see
The blue circles show the Frobenius distance for yeast protein kinases achieved using the old style Predikin position weight matrices sorted into ascending order. The red squares show the corresponding distance using the new style position weight matrix. In all cases except one the new style position weight matrix produces a smaller distance than the old style as demonstrated by the green line being below the red.
The newer style matrices show a general trend to lower Frobenius distances, and hence lower p-values. As the primary purpose of Predikin is to enable predictions of phosphorylation events, we investigated whether this decrease in Frobenius distance correlates with an increase in predictive power. ROC analysis comparing the two styles of position weight matrix shows that there is almost no difference in predictive power between the two styles of position weight matrix (
The predictive power, as assessed by the area under the ROC curve analysis, of the new-style matrices (black dashed) is virtually identical to that of the old-style (red solid). Demonstrating that Frobenius distance does not necessary provide an insight as to which weight matrix is the best for predictive purposes.
We further investigated the usefulness of the Frobenius distance and associated p-values by testing artificial position weight matrices that show no sequence preference against the protein kinases from the DREAM4 challenge. We constructed three position weight matrices had equal probabilities for all amino acids in all positions (values of 0.05 represent equal probability between the 20 amino acids) except for the phospho-residue position. One weight matrix had probabilities of 0.05 for all amino acids, the second had probabilities of 0.5 for serine and threonine and 0 for all other amino acids in the phosphorylated position, and the third had probabilities of 0.33 for serine, threonine and tyrosine and 0 for all other amino acids in the phosphorylated position. The lowest Frobenius distances was obtained by only assuming the phospho-residue is either serine or threonine — the p-values for these matrices are all lower than the ones obtained by Predikin in the DREAM4 challenge (
M1 | M2 | M3 | ||||
Kinase | Distance | p-value | Distance | p-value | Distance | p-value |
MELK | 0.9492 | 2.12e-3 | 0.6716 | 1.33e-28 | 0.7859 | 2.44e-15 |
BIKE | 0.9817 | 1.75e-3 | 0.7167 | 6.64e-39 | 0.8249 | 5.39e-19 |
CAMKK2 | 0.9765 | 1.15e-3 | 0.7096 | 7.48e-25 | 0.8187 | 1.19e-14 |
M1 is a position weight matrix with 0.05 probability for all amino acids in all positions; M2 is a matrix with 0.05 probability for all amino acids in all positions except the phosphorylated residue where P(S) = 0.5 and P(T) = 0.5 and M3 is a matrix with 0.05 probability for all amino in all position except the phosphorylated residue where P(S) = 0.33, P(T) = 0.33 and P(Y) = 0.33.
It is important to remember that some protein kinases are less specific than others, and that in situations involving these kinases a position weight matrix where many of the probabilities are close to 0.05 may be entirely appropriate. To see if this was the case for the kinases in the DREAM4 challenge we produced sequence logos
The height of symbols within each stack reflects the kinases relative preference of the corresponding amino acid at that position. (Logos were produced with WebLogo
The Predikin algorithm entered the recent DREAM4 challenge and was declared “best performer” in the protein kinase section of the Peptide Recognition Domain specificity prediction category. In the following discussion, it should be noted that the DREAM4 predictions were made before some of the new features of Predikin described above had been implemented and before the evaluations with the yeast kinases had been completed. We were, therefore, unable to take full advantage of the knowledge subsequently gained.
There were three protein kinases in the Peptide Recognition Domain specificity section of the challenge: MELK , BIKE and CaMKK2. In all three cases, the Frobenius distance produced from Predikin's position weight matrix was the lowest achieved by any of the challenge entrants. By default, Predikin used BLOSUM62 as its substitution matrix with a cut-off value of 1. For some of the kinases in the DREAM4 challenge we had to adjust these settings. We used the following: BLOSUM62 with a cut-off value of 1 for CaMKK2, BLOSUM62 with a cut-off value of 0 for MELK and BLOSUM35 with a cut-off value of 0 for BIKE.
Submitted Method | New Method | |||
Kinase | Distance | p-value | Distance | p-value |
MELK | 0.869 | 4.181e-08 | 0.694 | 9.541e-26 |
BIKE | 0.913 | 2.055e-08 | 0.854 | 5.844e-15 |
CaMKK2 | 0.916 | 3.457e-07 | 0.938 | 7.536e-07 |
The table shows Frobenius distances for position weight matrices built with the submitted and new method. In two of three cases there is a very significant improvement in p-value, while in the third case there is a very small increase in distance.
Predikin was the best performer in protein kinase section of the Peptide Recognition Domain category of the recent DREAM4 challenge: meaning that is was able to predict the experimentally obtained position weight matrix more accurately than any other entrant. This was true for each kinase that comprised the challenge.
Visualisation of the weight matrices, through sequence logos, reveals that there is a mixture of cases where Predikin predicts the specificity reasonably well and cases where there is still room for improvement. Even though Predikin sometimes fails to predict the correct specificity, there are no superior predictors currently available, especially when the repertoire of kinases it can make predictions for is considered. Existing predictors with better reported performance than Predikin have a more restricted repertoire of protein kinase for which they can make predictions, generally because they can only make predictions for kinases with available experimental information on their specificity. Predikin is much less restricted in this regard, it does not require any prior knowledge about the kinases specificity. This makes Predikin an invaluable resource when the protein kinase under consideration is not one of those that has been previously characterised. It should also be noted that there is more to recognition than solely binding of a specific sequence motif to the kinase (i.e., peptide specificity) alone
The three reported improvements to extend the repertoire of protein kinases Predikin can handle were successful in increasing the number of kinase from the yeast data set from 25% to over 91%, and we have shown that while these changes do not increase the prediction accuracy, of the system they do not adversely affect it either. We developed a method of producing weight matrices that gave lower Frobenius distances, and much lower p-values, than our original method. However, testing revealed that the drop in Frobenius distance did not correspond to an increase in prediction accuracy, as assessed by the area under the ROC curve. One reason for this discrepancy is that one only needs to correctly (or near correctly) predict amino acid specificity for one site but not others to obtain a result that would score as significantly different from random. We also showed that by using a weight matrix that showed no sequence preferences we could obtain very low p-values, but on the other hand such a matrix contains no information about specificity.
From sequence logos derived from the experimentally determined weight matrices it can be observed that usually a kinase only has a well-defined specificity at one or two residue positions. This means that many small changes to other positions (to bring them closer to 0.05 for all amino acids) may have a big effect on Frobenius distance, but provide little useful information regarding specificity.
While Frobenius distance and p-value may be useful in determining which of several matrices is closest to the experimental one, they do not provide a good indication of predictive power or indicate the likelihood of the matrix representing the true position weight matrix. The Frobenius distance suffers from the same problem as other statistics that reduce data to a single global measure in that it does not give local information i.e., there may be local areas that are accurate but some that are not. Ultimately the best measure of accuracy depends on what the weight matrix is intended to be used for. In the case of Predikin it is to make predictions about potential phosphorylated substrates; therefore, the best measure of success is the ability of the weight matrices to identify true phosphorylation substrates. However, this requires a different type of experimental evidence with which to test the matrices – data about which kinases phosphorylate which substrates, rather than an experimentally determined weight matrix, and this is often not available.
Predikin continues to improve and is a valuable resource for researchers working with protein kinases. Predikin has outperformed other kinase specificity prediction algorithms in an independent test of unpublished data. This combined with several major improvements to the Predikin web server — easier substrate-to-kinase predictions, proteome analysis and new techniques to increase the number of kinases Predikin can work with — make Predikin an important part of a kinase researchers toolbox. The performance of some of the new features has been evaluated against previously published data on yeast protein kinases. We find that these improvements dramatically increase the number of kinase that Predikin is able to make predictions for, and that the accuracy of those predictions is not adversely affected. However, we also find that the evaluation method used in DREAM4 is not necessarily the most appropriate to identify the best predictors.
Predikin predicts peptide specificity of protein kinases by building a position weight matrix and then using this matrix to score potential phosphorylation sites. For Predikin, a position weight matrix is a 20×7 matrix where each column represents one residue position in a potential substrate with the phosphorylated residue position represented by column 4 (that is, Predikin considers the −3 to +3 residue positions relative to the phosphorylated residue). Each row of the position weight matrix represents one of the twenty amino acids. Individual weights represent the likelihood of a particular amino acid occurring at the specific position in a phosphorylated substrate.
The core of Predikin's approach is the concept of specificity-determining residues. A specificity-determining residue is a conserved amino acid residue, located in the catalytic domain of a protein kinase, that determines what substrate residues will be preferred at a particular position. When a kinase binds to a substrate, the substrate amino acid residues at positions −3 to +3 relative to the phosphorylated residue make contact with specificity-determining residues in a binding pocket on the surface of the kinase. The nature of the specificity-determining residues determines which residues are most likely to be found around the phosphorylation site — that is, which residues “fit” best in the binding pocket. The binding pocket, therefore, makes a major contribution to the specificity of the kinase for different substrates.
Specificity-determining residues where chosen on the basis of an analysis of the crystal structures of peptide complexes of protein kinases, and the location of key binding residues were defined in relation to structural features and conserved sequence motifs
The input to Predikin is a protein sequence in FASTA format. Predikin attempts to identify a kinase catalytic domain in the sequence by matching it to the SMART
The background frequency of residue
PredikinDB is constructed from data extracted from the UniProt and phospho.ELM databases; although, it can only extract data when a specific kinase is linked to a phosphorylated residue, and in many cases this level of information is not available. It stores information about phosphorylation events and links these to specific protein kinases. Information about the specificity-determining residues for each kinase is also contained in the database. PredikinDB is regularly updated in an automated fashion, and constitutes an important phosphorylation data resource in itself.
The old-style weight matrices were created by normalising the matrices produced by Predikin described above so that for each position the weights summed to a total probability of 1. The new method calculates the frequency of each amino acid in the same way as for the original weight matrix (Equation 1), but does not transform this frequency into a log-odd score. Instead the following formula was applied to transform the frequency matrix into a weight matrix:
To assess the quality of Predikin's position weight matrices we used the same evaluation method as the DREAM4 challenge: similarity to a experimentally mapped position weight matrix using the distance induced by the Frobenius norm. The Frobenius norm is equal to the square root of the matrix trace of
Predikin is available as a web-server at
We would like to thank Ben Turk and colleagues for providing experimental position weight matrices for yeast kinases ahead of publication and Gustavo Stolovitzky and Robert Prill for providing the source code used in the DREAM4 evaluation. We would also like to thank the anonymous reviewers whose comments have helped shape the final version of the manuscript.