Computational Prediction and Experimental Verification of New MAP Kinase Docking Sites and Substrates Including Gli Transcription Factors
Figure 2
Validation of the D-learner hidden Markov model.
(A–C). The HMM accurately identifies known D-sites in full-length sequences. Full-length sequences run through the HMM give a Viterbi probability for every window tested. The x-axis displays the window number and the y-axis shows the log of the Viterbi probability for each window. The dashed lines represent the threshold of E-23 for a window to be considered a predicted D-site. MKK4 (A) has one peak, MKK7 (B) has three peaks, and the arbitrarily-chosen full length sequence SEMA3C (C) has zero peaks above the threshold. (D) The HMM does not score randomized sequences highly, even if they have the same composition as a high-scoring D-site. Histogram of scores assigned to 1,000 scrambled sequences with same sequence composition as the MKK4 D-site (blue, left ordinate labels) and the 20 training set D-site sequences (green, right ordinate labels). Sequences were binned by score, with no sequences scoring below −37 or above −14. For the MKK4 randomized set, zero sequences surpassed the −23 threshold (dashed line). For the 20,000 total randomized D-site sequences, 30 sequences (0.15%) scored above this threshold. For the training set, 16 sequences (80%) surpassed the E-23 threshold. (E) The HMM scores JNK D-sites higher than D-sites selective for ERK- or p38-family MAPKs. The name, D-learner-assigned score, and sequence of all known human MKK D-sites are shown. The JNK D-sites (MKK4 and the 3 MKK7 D-sites) surpass the −23 threshold; however, the non-cognate D-sites, although they contain the core consensus basic (blue) and hydrophobic (red) residues, do not score above the threshold.