Figure 1.
(A) Location of docking domains at protein termini. A pair of interacting PKS proteins are represented as chevrons, with their docking domains indicated in grey: the C-terminal head domain (pointed) and the N-terminal tail domain (notched). The boxes immediately below give an overview of the 149 head and tail domains in our dataset. Each horizontal line represents a single protein sequence; the sequences are aligned according to their conserved docking domains (grey), and sorted according to the lengths of their unconserved overhangs. We see that most docking domains are to be found within a few amino acids of protein termini.
(B) Representative multiple-sequence alignments of the 19 aa C-terminal head domains (left) and 27 aa N-terminal tail domains (right). Hydrophobic residues are colored red, hydrophilic residues are colored blue, and intensity reflects sequence conservation [29]. Note that sequences are sorted differently than in Figure 1A. Each row shows an interacting head–tail pair, labeled by the PKS pathway it belongs to, and the interface number within that pathway. For example, ampho_003 represents the interface between proteins 3 and 4 of the amphotericin PKS. (Pathway abbreviations are described in detail in Dataset S1.) The head and tail alignments are divided into three groups each (H1, H2, H3, and T1, T2, T3) corresponding to the clusters discussed in Figure 2.
Figure 2.
Docking Domain Compatibility Classes
(A) Docking domains are clustered according to sequence similarity (Text S1, Section 2). Each node represents a particular head (left) or tail (right) domain; two domains are connected by a line if their BLAST e-value is less than a defined cutoff (2.0e-4 for heads, and 1.0e-4 for tails). Head and tail domains each independently assort into three phylogenetic clusters, labeled H1, H2, H3, and T1, T2, T3, respectively. For the moment, cluster coloring is arbitrary.
(B) Examples of PKS multiprotein chains. Each row shows a different PKS pathway, with names as defined in Dataset S1. Proteins are represented as chevrons, with C-terminal head domains (pointed) and N-terminal tail domains (notched) now colored according to their phylogenetic group, as defined in Figure 2A. The pathway termini, as well as domains which could not be clustered, are colored grey. Note that interactions predominantly occur between docking domains of the same color. There are only two exceptions to this rule, one of which is shown in the nanch multiprotein chain. (The other, in the nidda multiprotein chain, involves a domain that lies at the boundary of its parent cluster, indicating that it has probably been misclassified by our clustering algorithm; see Dataset S2.) Actinobacterial pathways tend to use domains pairs of type H1–T1 and H2–T2, while myxobacterial and cyanobacterial pathways use domain pairs of type H3–T3 alone. (Again, the few exceptions to this rule, such as in the epoth multiprotein chain, are likely due to misclassification.)
(C) Phylogenetic clusters coincide with docking domain compatibility classes. Each PKS pathway in our dataset gives us a list of known interactors, as well as a list of known noninteractors. For example, the amphotericin pathway contains five internal head and tail domains; of the 25 possible pairings of these domains, five represent interactors (three H1–T1 and two H2–T2), while the remaining 20 represent noninteractors (six H1–T1, six H1–T2, six H2–T1, and two H2–T2). We tallied such interaction and noninteraction information over the 42 PKS pathways in our dataset, and summarized our results in a single table. Each row corresponds to a head cluster, and each column to a tail cluster. The top-left entry of every cell reports the number head–tail pairs of the given variety known to be interactors; the bottom-right entry reports the number of head–tail pairs known to be noninteractors. For example, we know one interactor and 73 non-interactors of the H1-T2 variety. The correspondence between the head and tail clusters is obvious: we find large numbers of interactors within compatible clusters (on-diagonal, highlighted in orange) and large numbers of noninteractors between them (off-diagonal, highlighted in purple). This defines a one-to-one pairing of head and tail clusters into three compatibility classes, H1–T1 (green), H2–T2 (red), and H3–T3 (blue), and justifies the common coloring used in Figure 2A. This division into compatibility classes is a useful predictive tool for actinobacterial pathways, since they tend to contain docking domains of multiple varieties. For example, since the ampho pathway has three H1–T1 domain pairs and 2 H2–T2 domain pairs, only 12 of the 120 possible ways of pairing them are compatible (2!3! out of 5!). Of the 33 actinobacterial pathways in our dataset, 19 contain both H1–T1 and H2–T2 varieties (Dataset S2). For these mixed pathways, on average less than a third of all possible ways of pairing their domains are compatible.
Figure 3.
In this figure, symbols associated with head domains are colored red, and those associated with tail domains are colored green. This coloring is unrelated to compatibility class, as all domains pictured belong to compatibility class H1–T1.
(A,B) CRoSS matrices, showing the residue pairs that significantly contribute to specificity (Methods). Residues on the C-terminal head domain are indexed vertically (i = 1, ... , 19); residues on the N-terminal tail domain are indexed horizontally (j = 1, ... , 27). Each entry shows -log10(ρij) for the corresponding site-pair, with the scale indicated on the color bar; the higher this value, the more significant the site-pair as a determinant of specificity.
(A) The H1–T1 control matrix, generated by comparing random pairings with noninteractors. Since we expect no significant hits, these entries provide us with an estimate of the random background.
(B) The H1–T1 interaction matrix, generated by comparing interactors with noninteractors. Several entries are highlighted above the background (white circles serve as guides to the eye). The matrix is sparse, showing that residue pairs at a few key sites vary independently, uncorrelated to any broad phylogenetic patterns. There are only seven significant residue pairs (Figure S1C). These are, in order of significance, {i, j} = {6,5}, {13,11}, {12,18}, {12,11}, {12,12}, {12,16}, {12,5}. The three head residues and five tail residues that make up these pairs are indicated along the axes by red and green arrows, respectively.
(C) Representative multiple-sequence alignments of H1 head domains (left) and T1 tail domains (right), with cartoon representations of the head and tail domains shown below. Each row shows an interacting head–tail pair, labeled by the PKS pathway it belongs to, and the interface number within that pathway. The three head residues and five tail residues selected by CRoSS are indicated by red and green arrows, respectively; the corresponding positions in the sequence alignments are highlighted in bold. The three most significant head and tail residues (used to define code words in Figure 4) are indicated by asterisks. The head and tail alignments are divided into four groups each (H1a–H1d and T1a–T1d) corresponding to the subclasses discussed in Figure 4.
Figure 4.
Code Words and Compatibility Subclasses
In this figure, symbols associated with head domains are colored red, and those associated with tail domains are colored green. This coloring is unrelated to compatibility class, as all domains pictured belong to compatibility class H1–T1.
(A) The three most significant CRoSS pairs pick out three residues each on the head (red) and tail (green) domains. These residues (indicated by asterisks in Figure 3C) are highlighted by arrows, with their position along the domain shown in parentheses. The amino acids at these positions define our code words.
(B) Schematic representation of code word clusters. Each node of the graph represents a unique head (red) or tail (green) code word, and each edge represents a known interaction (orange) or noninteraction (purple) between code words. We use a Monte Carlo algorithm (Text S1, Section 3) to group the nodes into clusters, such that interactions are enriched within a cluster, and noninteractions are enriched between clusters. These clusters thus represent a refinement of H1-T1 into compatibility subclasses.
(C) Actual code word clusters. This is a matrix representation of the interaction graph shown in Figure 4B. Each row corresponds to a head code word (red), and each column to a tail code word (green). The entries represent edges, showing that the corresponding code word pairs have been found on known interactors (orange), known noninteractors (purple), both (pink), or neither (white). Code words are grouped into four clusters (each labeled by a different shade of red or green), corresponding to subclasses of H1–T1 within which interactions are enriched. Nodes that occur as singletons are not shown.
(D) Synonymous sets of code words. The code words belonging to each subclass are explicitly listed, in the same order as in the matrix of Figure 4C. Comparison with the matrix shows that, within a given subclass, each head is compatible with several tails, and vice-versa. The subclasses are labeled by shade, as well as by the index a, b, c, d. Within each subclass, we see a high degree of code word sequence similarity. If an amino acid occurs in a majority of instances at a given position, it is included in the consensus sequence characterizing a given subclass.
(E) Histogram of clustering energies for 50 datasets with randomized interactions (Text S1, Section 3). The more negative the energy, the better the clustering. The red line indicates the energy of the true dataset, far to the left of the distribution for randomized datasets. This indicates that the observed degree of clustering is statistically significant, with p-value < 0.02.
Figure 5.
CRoSS Residues and Physical Interactions
In (B–F), symbols associated with head domains are colored red, and those associated with tail domains are colored green. This coloring is unrelated to compatibility class, as all domains pictured belong to compatibility class H1–T1.
(A) Comparison of the residues selected by CRoSS to those that are in physical contact in the docked domain NMR structure. The matrix has residues of the head domain indexed vertically, and residues of the tail domain indexed horizontally. Each entry is shaded gray if the pairwise distance between the corresponding residues is 5 Å or less, and white otherwise. The residue pairs selected by CRoSS are highlighted as red boxes; remarkably, four of the seven CRoSS pairs are separated by 5 Å or less. Residue pairs previously suggested as contributing to specificity are highlighted as blue circles: R1, suggested by Broadhurst et al. [15] and Weissman [20] as “code residue pairs” that play a critical role in discrimination; R2, demonstrated by Weissman [21] to alter the efficiency of docking in the erythromycin PKS. It has also been suggested that the entire complement of hydrophobic residues at the docking interface might contribute to specificity [20].
(B–C) Structure of the docking complex between the head domain of protein 2 (red) and the tail domain of protein 3 (green) of the erythromycin PKS [15]. PKS proteins exist as homodimers, so the docking complex involves a pair of tail domains TA and TB (which form a coiled coil) and a pair of head domains HA and HB (which are alpha helices that lock around the coiled coil).
(B) Cartoon representation of the structure, showing domain labels.
(C) Space-filling representation of the structure. The head domains are now colored dark grey, and the tail domains are colored light gray, while the head and tail residues selected by CRoSS are colored red and green, respectively. The three circles indicate regions that have been magnified in subsequent panels of the figure.
(D–F) Space-filling representations of the structure, magnified around the neighborhood of the CRoSS residues. The side chains of the significant head (red) and tail (green) residues are also shown. As seen in Figure 5A, four out of the seven CRoSS residue pairs correspond to pairwise physical interactions: {i, j} = {6,5}, {12,12}, {13,11}, {12,16}. Note that the two copies of any residue participate in symmetric contacts, but it is not possible for CRoSS to assign which of two possible pairings, i.e., A–A and B–B versus A–B and B–A, actually occurs. For example, HB6 contacts TA5, while HA6 contacts TB5. One copy of each interacting pair is shown here: (D) HB6-TA5, (E) HA12-TB12, (E) HA13-TA11, (F) HB12-TA16. We have also highlighted (F) TA18, which does not lie on the head–tail interface but is instead buried within the tail coiled-coil. It is possible that destabilization of the TA18-TB18 contact changes the global conformation of the coiled-coil, and thus has a long-range effect on head–tail interactions. (Similar long-range effects have been demonstrated to alter the efficiency of docking in the erythromycin PKS [21].)
Figure 6.
Domain Shuffling and Domain Linkage
(A) ΔH is the fraction of amino acid differences between some pair of head domains; ΔT is the fraction of amino acid differences between some pair of tail domains. We compare the observed degree of similarity of two head domains (ΔH), with that of the two tail domains on the same proteins (ΔT) or the two tail domains that are their interaction partners (ΔT′). Each such comparison gives us a point in ΔH − ΔT space; by running over all possible pairs, we generate a family of points.
(B,C) Density plots of points in ΔH − ΔT space, with axes running from 0 (identical) to 1 (distinct). To generate these plots, a 2-D histogram was calculated by binning the data into a 10-by-10 grid, which was then smoothed by interpolation. The multimodal appearance of these plots is a manifestation of the underlying phylogenetic clusters.
(B) Density plot of ΔH versus ΔT, when the head and tail domains belong to the same protein. A priori, we expect two proteins descended from a common ancestor to show a uniform degree of sequence similarity across their entire length. Instead, we find that ΔH is largely uncorrelated with ΔT (cc = 0.13). That is, proteins with very similar head domains can have very diverged tail domains, and vice versa. This implies that proteins are not inherited in their entirety, but instead undergo frequent domain shuffling.
(C) Density plot of ΔH versus ΔT′, when the head and tail domains are interaction partners. In this case, ΔH and ΔT′ are highly correlated (cc = 0.67), implying that these two domains are evolutionarily linked. Remarkably, the interacting domain pair straddling two proteins, rather than the protein itself, constitutes the true unit of inheritance.