The Origins of Specificity in Polyketide Synthase Protein Interactions
Figure 4
Code Words and Compatibility Subclasses
In this figure, symbols associated with head domains are colored red, and those associated with tail domains are colored green. This coloring is unrelated to compatibility class, as all domains pictured belong to compatibility class H1–T1.
(A) The three most significant CRoSS pairs pick out three residues each on the head (red) and tail (green) domains. These residues (indicated by asterisks in Figure 3C) are highlighted by arrows, with their position along the domain shown in parentheses. The amino acids at these positions define our code words.
(B) Schematic representation of code word clusters. Each node of the graph represents a unique head (red) or tail (green) code word, and each edge represents a known interaction (orange) or noninteraction (purple) between code words. We use a Monte Carlo algorithm (Text S1, Section 3) to group the nodes into clusters, such that interactions are enriched within a cluster, and noninteractions are enriched between clusters. These clusters thus represent a refinement of H1-T1 into compatibility subclasses.
(C) Actual code word clusters. This is a matrix representation of the interaction graph shown in Figure 4B. Each row corresponds to a head code word (red), and each column to a tail code word (green). The entries represent edges, showing that the corresponding code word pairs have been found on known interactors (orange), known noninteractors (purple), both (pink), or neither (white). Code words are grouped into four clusters (each labeled by a different shade of red or green), corresponding to subclasses of H1–T1 within which interactions are enriched. Nodes that occur as singletons are not shown.
(D) Synonymous sets of code words. The code words belonging to each subclass are explicitly listed, in the same order as in the matrix of Figure 4C. Comparison with the matrix shows that, within a given subclass, each head is compatible with several tails, and vice-versa. The subclasses are labeled by shade, as well as by the index a, b, c, d. Within each subclass, we see a high degree of code word sequence similarity. If an amino acid occurs in a majority of instances at a given position, it is included in the consensus sequence characterizing a given subclass.
(E) Histogram of clustering energies for 50 datasets with randomized interactions (Text S1, Section 3). The more negative the energy, the better the clustering. The red line indicates the energy of the true dataset, far to the left of the distribution for randomized datasets. This indicates that the observed degree of clustering is statistically significant, with p-value < 0.02.