Accurate, high-coverage assignment of in vivo protein kinases to phosphosites from in vitro phosphoproteomic specificity data

Phosphoproteomic experiments routinely observe thousands of phosphorylation sites. To understand the intracellular signaling processes that generated this data, one or more causal protein kinases must be assigned to each phosphosite. However, limited knowledge of kinase specificity typically restricts assignments to a small subset of a kinome. Starting from a statistical model of a high-throughput, in vitro kinase-substrate assay, I have developed an approach to high-coverage, multi-label kinase-substrate assignment called IV-KAPhE (“In vivo-Kinase Assignment for Phosphorylation Evidence”). Tested on human data, IV-KAPhE outperforms other methods of similar scope. Such computational methods generally predict a densely connected kinase-substrate network, with most sites targeted by multiple kinases, pointing either to unaccounted-for biochemical constraints or significant cross-talk and signaling redundancy. I show that such predictions can potentially identify biased kinase-site misannotations within families of closely related kinase isozymes and they provide a robust basis for kinase activity analysis.


Gene Ontology semantic similarity
Semantic similarity of Gene Ontology (GO) terms between a kinase and a putative substrate protein was calculated using the "GOSemSim" package for R (version 2.18.0) [10,11]. Semantic similarity was calculated using the "Wang" measure and combined using the "BMA" method. Missing values (due to lack of annotation) were imputed from the full in vivo feature table via multivariate imputation by chained equations using the "mice" package for R (40 maximum iterations; version 3.13.0) [12].

STRING functional association scores
For all STRING component scores considered ("coexpression", "experimental", "cooccurence", "fusion", and "neighborhood"), values were transformed to be between 0 and 1 by dividing by 1000. Missing values were set to the prior value of 0.041 used by the STRING developers. STRING uses Ensemble identifiers for proteins, whereas the rest of the analysis used Uniprot identifiers. In some cases, many Ensembl identifiers can map to the same Uniprot identifier. In these cases, the means of the scores assigned to all relevant identifiers were used.

Kinase specificity models Position Frequency Matrix
Sequence windows were first weighted using a position-based sequence-weighting method [13]. Given a set of n ≥ 20 substrate sequence windows, S = S 1 , S 2 , . . . , S i , . . . , S n−1 , S n , where S i = S i1 , S i2 , . . . , S i14 , S i15 and S i j represents the amino acid at position j of sequence i, a weight was assigned to each amino acid a at position j: where c j is the number of unique amino acids found at position j among the substrates in S. Phosphosites near the 5' or 3' tail of the protein may not have 7 residues adjacent to them to form a complete sequence window. At each missing tail position, a 21 st amino acid was counted. A weight was calculated for each sequence as the sum of its position-specific residue weights: Finally, each sequence's weight was normalized: A 20 × 15 counts matrix, r was constructed such that entry r a j contains the weighted count of amino acid a, ignoring missing tails, at position j across the sequences in S: To account for the fact that S necessarily represents a limited sample of possible substrate sequences, a total number of pseudocounts, B j , were distributed at position j via 20 × 15 pseudocount matrix b as described below. The final PFM, p was then calculated from the counts and pseudocounts such that the likelihood of observing amino acid a at position j is: Following Henikoff and Henikoff [14], a position-specific pseudocount strategy was adopted.
For column j the total number of pseudocounts to be distributed among the amino acids is: where T j is the number of missing tails observed in position j among the substrates in S, c j is defined as in Equation 1, and m is a tune-able parameter, which was fixed at 1 [15]. Note that positions with lower diversity of observed residues will receive fewer pseudocounts.
Pseudocounts were then distributed according to a 20 × 15 expected frequency matrix f. In the case of PFMs evaluated directly for predictive performance or used in the construction of phosphoproteome-backed PSSMs, f was constructed as a PFM (as above) on the full set of in vitro substrates from all kinases, S BG , with no pseudocounts (B j = 0). In the case of PFMs used in the construction of proteome-backed PSSMs, f was constructed such that element f a, j = F a , where F a is the proteomic frequency of residue a. Pseudocount matrix b was then calculated such that residue a at position j receives b a, j pseudocounts: Finally, for each position j, a column weightŵ j was calculated as the relative entropy (Kullback-Leibler divergence) of the residue frequencies at that position, p j , versus the background expectation, f j , rescaled to sum to 15 (the number of positions): To assign an unseen substrate, S, to a kinase with PFM p and column weightsŵ, a score is calculated as: Position-Specific Scoring Matrix 20 × 15 PSSMs were calculated as log-likelihood ratios against the background expectation f: As described above, "proteome-backed" PSSMs were calculated against a background expectation of proteomic residue frequencies, while "phosphoproteome-backed" PSSMs were calculated against a position-specific background expectation derived from all phosphosite sequences observed in the experiment.
For assigning an unseen substrate, S, to a kinase with PSSMp and column weightsŵ, a score is calculated as: For multi-label assignment, a min-max normalized PSSM score,ŝ PSSM , was calculated to take values between 0 and 1 [16]:

Multi-label Naïve Bayes
We wish to calculate the posterior probability that a phosphosite is phosphorylated by kinase K, given its surrounding sequence window S. Using Bayes' Theorem: By applying the chain rule and then the Naïve Bayes assumption of conditional independence between the positions in S, this can be simplified as follows: We model the likelihood function, L (S j |K) as a categorical distribution. It is then easy to see from Equation 11 that, with the Naïve Bayes assumption of conditional independence: By altering the values inŵ, we arrive at a weighted Naïve Bayes formulation: So, the conditional likelihood function L (S|K) is calculated as described for PFMs.
The prior probability, P(K) was empirically estimated as the proportion of all substrates observed in the experiment (S BG ) that were observed as phosphorylated by kinase K (S K ): Typically the marginal probability, P(S), is ignored in Naïve Bayes assignment as it is just a normalizing constant. However, in multi-label Naïve Bayes, normalization is necessary to compare probabilities between classes (kinases) [17]. By the law of total probability: The second term represents the likelihood of sequence S being phosphorylated by any other kinase in the experiment (K). This is computed exactly as for kinase K, using sequences not phosphorylated by K, SK, to build the PFM and calculate the prior probability: To perform multi-label assignment of phosphosite sequence S, for each kinase K i in the set of kinases used in the experiment, K, posterior probabilities were calculated as Kinase K i was assigned to the phosphosite with sequence S if P(K i |S) > P(K i |S) [17]. This is equivalent to the condition that P(K i |S) > 0.5, thus P(K i |S) need not be calculated.

Naïve Bayes+
The Naïve Bayes+ model is an extension of the multi-label Naïve Bayes model described above.
In addition to the sequence evidence S, additional Bernoulli-distributed features were added -I K,T orĪ K,T : presence or absence of a direct physical interaction between kinase K and substrate protein T ; J K,T orJ K,T : presence or absence of an indirect physical interaction between them; D orD: the phosphosite is or is not in a predicted domain on T ; and E orĒ: substrate T does or does not carry a predicted domain that is enriched among K's substrates or physical interaction partners. An indirect interaction was defined as the substrate protein T being "two hops" away from K on the protein interaction network, i.e. they share a neighbor.
Thus, for example, the Naïve Bayes formulation of the posterior probability that kinase K phosphorylates a site with sequence S, that is not in a domainD, with no direct physical interaction I K,T , an indirect interaction J K,T , and no enriched domainsĒ would be: For all four additional features, the Bernoulli parameter p was estimated by calculating the fraction of K's substrates that have the feature (or, forK, the fraction of all sites that are not substrates of K). Domains enriched within the substrate proteins or interaction partners of kinase K were identified via Fisher's Hypergeometric Test. For each unique domain predicted to be on at least one substrate protein, presence or absence of that domain among proteins were counted contingent upon the proteins being either substrates or physical interaction partners of K or not. p-values were adjusted for false discovery rate and tested at a critical value of 0.05.
Approximate logistic relationship between phosphoproteome-backed PSSM score and Naïve Bayes posterior probability As described above, the definition of the Naïve Bayes posterior probability of assigning kinase K to a phosphosite with sequence window S is: In terms of the likelihood, the phosphoproteome-backed PSSM score function s PSSM (S,p K ,ŵ K ) can be re-written as: Likelihood L (S) is the overall likelihood of observing a phosphosite with sequence window S, as calculated using the full phosphoproteome background set, S BG . Likelihood L (S|K) is the likelihood of observing a phosphosite with sequence window S among the substrates of kinase K, as calculated using a set of known substrates of kinase K, S K . Furthermore, we assume: We apply a logit transformation to the Naïve Bayes posterior probability P(K|s), using 2 as our logarithm base: logit(P(K|S)) = log 2 P(K|S) 1 − P(K|S) Given that |S K | ≪ |S BG | (Equation 38), L (S|K) ≈ L (S). Put simply, the likelihood model constructed from a set of over 14000 phosphosites will not be strongly impacted if one were first to omit a subset of a few hundred sites on average. As a result: logit(P(K|S)) ≈ log 2 L (S|K) L (S)) + log 2 P(K) P(K) logit(P(K|S)) ≈ s PSSM (S,p K ,ŵ K ) + log 2 P(K) P(K) From Equation 38, it also follows that |S K | < |SK|, so P(K) < P(K) and log 2 P(K) P(K) < 0. This imposes a dependence of the relationship on |S K |: a larger |S K | causes a lower inflection point in the inverse logistic fit, i.e. the PSSM score s PSSM (S,p K ,ŵ K ) at which P(K|S) = 0.5 and logit(P(K|S)) = 0.  Fig A: Using relative entropy as a PFM or PSSM column weight provides greater discrimination of columns than using information content. a) Sequence logo of all S/T sites in a high-confidence human phosphoproteome [2]. b) Information content at each position among S/T sites. c) Relative entropy versus proteomic residue frequencies at each position among S/T sites. Relative entropy provides greater separation between positions. d-f) As in (a-c) for Y sites. Differing theoretical and empirical score ranges in scoring matrix-based models hinders the selection of a universal score cut-off. Each vertical bar represents one kinase, colored by S/T (green) or Y kinases (orange). Light bars show theoretical score ranges and dark bars show empirical score ranges for the four sequence-only kinase specificity models. Empirical ranges were identified using the high-confidence human phosphoproteome [2]. PFM scores are shown log 10 -transformed. Note that PFM and PSSM-based models have very different ranges between kinases. The maximum theoretical range also varies, making universal cutoff definitions difficult. Also, PFMs and PSSMs do not utilize their full range for all kinases when scoring physiologically occurring phosphosites. The Naïve Bayes model exhibits none of these problems.