Structural and Functional Diversity of the Microbial Kinome

The eukaryotic protein kinase (ePK) domain mediates the majority of signaling and coordination of complex events in eukaryotes. By contrast, most bacterial signaling is thought to occur through structurally unrelated histidine kinases, though some ePK-like kinases (ELKs) and small molecule kinases are known in bacteria. Our analysis of the Global Ocean Sampling (GOS) dataset reveals that ELKs are as prevalent as histidine kinases and may play an equally important role in prokaryotic behavior. By combining GOS and public databases, we show that the ePK is just one subset of a diverse superfamily of enzymes built on a common protein kinase–like (PKL) fold. We explored this huge phylogenetic and functional space to cast light on the ancient evolution of this superfamily, its mechanistic core, and the structural basis for its observed diversity. We cataloged 27,677 ePKs and 18,699 ELKs, and classified them into 20 highly distinct families whose known members suggest regulatory functions. GOS data more than tripled the count of ELK sequences and enabled the discovery of novel families and classification and analysis of all ELKs. Comparison between and within families revealed ten key residues that are highly conserved across families. However, all but one of the ten residues has been eliminated in one family or another, indicating great functional plasticity. We show that loss of a catalytic lysine in two families is compensated by distinct mechanisms both involving other key motifs. This diverse superfamily serves as a model for further structural and functional analysis of enzyme evolution.


Introduction
The eukaryotic protein kinase (ePK) domain is the most abundant catalytic domain in eukaryotic genomes and mediates the control of most cellular processes, by phosphorylation of a significant fraction of cellular proteins [1][2][3]. Most prokaryotic protein phosphorylation and signaling is thought to occur through structurally distinct histidine-aspartate kinases [4]. However, there is growing evidence for the existence and importance of different families of ePK-like kinases (ELKs) in prokaryotes [5][6][7][8][9][10]. ePKs and ELKs share the protein kinaselike (PKL) fold [11] and similar catalytic mechanisms, but ELKs generally display very low sequence identity (7%-17%) to ePKs and to each other. Crystal structures of ELKs such as aminoglycoside, choline, and Rio kinases reveal striking similarity to ePKs [12][13][14], and other ELKs have been defined by remote homology methods [6,15] and motif conservation [16]. Another set of even more divergent PKL kinases are undetectable by sequence methods, but retain structural and mechanistic conservation with ePKs. These include the phosphatidyl inositol kinases (PI3K) and related protein kinases, alpha kinases, the slime mold actin fragmin kinases, and the phosphatidyl inositol 59 kinases [17][18][19][20].
These studies demonstrate that PKL kinases conserve both fold and catalytic mechanisms in the presence of tremendous sequence variation, which allows for an equivalent diversity in substrate binding and function. This makes the PKL fold a model system to investigate how sequence variation maps to functional specialization. Previous studies along these lines include the study of ePK-specific regulatory mechanisms, through ePK-ELK comparison [16], and the sequence determinants of functional specificity within one group (CMGC [CDK, MAPK, GSK3, and CLK kinases]) of ePKs [21].
Previous studies have been hampered by poor annotation and classification of ELK families and their low representation in sequence databases relative to ePKs. Recent large-scale microbial genomic sequencing, coupled with Global Ocean Sampling (GOS) metagenomic data, now allow a much more comprehensive analysis of these families. In particular, the GOS data provides more than 6 million new peptide sequences, mostly from marine bacteria [22,23], and more than triples the number of ELK sequences. Here, for the first time, we define the extent of 20 known and novel PKL families, define a set of ten key conserved residues within the catalytic domain, and explore specific elaborations that mediate the unique functions of distinct families. These highlight both underappreciated aspects of the catalytic core as well as unique family specific features, which in several cases reveal correlated changes that map to concerted variations in structure and mechanism.

Discovery and Classification of PKL Kinase Families
Kinase sequences were detected using hidden Markov model (HMM) profiles of known PKLs as well as with a motif model focused on key conserved PKL motifs [16,24]. Results of each approach were used to iteratively build, search, and refine new sets of HMMs, using both public and GOS data. Weak but significant sequence matches were used as seeds to define and elaborate novel families. The final result was 16,248 GOS sequences (Dataset S1) classified into 20 HMMdefined PKL families (Table 1; Figure 1; Dataset S2). A similar analysis of the National Center of Biotechnology Information nonredundant public database (NCBI-nr) revealed 24,924 ePK and 5,151 ELK sequences (Dataset S1). More than 1,400 of the NCBI ELK sequences were annotated as hypothetical or unknown, and several hundred more are misannotated or have no functional annotation. GOS data at least doubles the size of most families, and permits an in-depth analysis of family structure and conservation. Two families that are more than 10-fold enriched in GOS (CapK and HSK2) are found largely in a proteobacteria, which are also highly enriched in GOS. Both CapK and HRK contain viral-specific subfamilies that are also greatly GOS enriched, indicating that differences in kinase distribution between databases are largely due to taxonomic biases. As expected, eukaryotic-specific families, (ePK, Bub1, PI3K, AlphaK) are underrepresented in GOS.

Functional Diversity of PKL Families
These 20 PKL families display great functional and sequence diversity, though common sequence motifs and functional themes recur. Some families are entirely uncharacterized, and few have been well studied, though most have some characterized members, many with known kinase activity. Their substrates include proteins and small molecules such as lipids, sugars, and amino acids, and they generally appear to have regulative functions (Table 1). This is in contrast to the diversity of several other structurally unrelated, small-molecule kinase families that play largely metabolic roles [25]. Profile-profile alignments show clear but distant relationships between several families, which are enclosed by ovals in Figure 1. The ePK cluster includes pknB, which is highly similar to but distinct from ePK and is

Author Summary
The huge growth in sequence databases allows the characterization of every protein sequence by comparison with its relatives. Sequence comparisons can reveal both the key conserved functional motifs that define protein families and the variations specific to individual subfamilies, thus decorating any protein sequence with its evolutionary context. Inspired by the massive sequence trove from the Global Ocean Survey project, the authors looked in depth at the protein kinase-like (PKL) superfamily. Eukaryotic protein kinases (ePKs) are the pre-eminent controllers of eukaryotic cell biology and among the best studied of enzymes. By contrast, their prokaryotic relatives are much more poorly known. The authors hoped to both characterize and better understand these prokaryotic enzymes, and also, by contrast, provide insight into the core mechanisms of the eukaryotic protein kinases. The authors used remote homology methods, and bootstrapped on their discoveries to detect more than 45,000 PKL sequences. These clustered into 20 major families, of which the ePKs were just one. Ten residues are conserved between these families: 6 were known to be important in catalysis, but four moreincluding three highly conserved in ePKs-are still poorly understood, despite their ancient conservation. Extensive family-specific features were found, including the surprising loss of all but one of the ten key residues in one family or another. The authors explored some of these losses and found several cases in which changes in one key motif substitute for changes in another, demonstrating the plasticity of these sequences. Similar approaches can be used to better understand any other family of protein sequences. distinguished by its exclusive bacterial specificity, as opposed to the mostly eukaryotic ePK family. The other major cluster is centered on the large and divergent CAK (choline and aminoglycoside kinase) family, and includes three other families of small-molecule kinases. CAK itself is particularly diverse, containing subfamilies that are specific for choline/ ethanolamine and aminoglycosides, as well as many novel subfamilies, some of which are specific to eukaryotic sublineages. A looser cluster is formed between the Rio and Bud32 families, which are universal among both eukaryotes and archaeae, and the bacterial lipopolysaccharide kinase family KdoK. An additional four families (UbiB, revK, MalK, CapK) are distantly related to all three clusters, and are distinct from another set-PI3K, AlphaK, and IDHK-which have even less similarity to any other kinase; for PI3K and AlphaK, the relationship to kinases was determined by structural comparisons [11], while IDHK displays only conservation of the key residues and motifs found in all PKL kinases.
Sequence similarity between these 20 families varies from very low (;20%) to almost undetectable. Sequence-profile methods are generally required to align families within the oval clusters of Figure 1, while alignments between clusters require profile-profile methods. The diversity of this collection is demonstrated by comparison with the automated sequence-and profile-based clustering of the overall GOS analysis [22], which assigns 93% of these sequences into 32 clusters, each of which is largely specific to one of our 20 families.

Key Conserved Residues Unify Diverse Kinase Families
Comparison between all families reveals a set of ten key residues that not only account for one-third of the residues conserved within each family, but also are consistently  [55] and two distinct sets of largely viral kinases, one of which lacks the GxGxxG and VAIK motifs, suggesting that they may be catalytically inactive and/or interfere with host kinase signaling. Bub1 9/112 Pan-eukaryotic protein kinase, functions in mitotic spindle assembly. Bud32 139/123 Universal/single copy gene in eukaryotes and archaeae. In vertebrates it phosphorylates p53, while in S. cerevisiae it is involved in bud site selection, both unconserved processes [56]. Recently implicated in telomere regulation [57]. Rio 133/249 Universal eukaryotic/archaeal protein kinase, implicated in control of translation, a function that is highly conserved between eukaryotes and archaeae [58]. Also found in some bacteria, particularly proteobacteria. KdoK 389/199 Small family of bacterial kinases known to phosphorylate sugar moieties of LPS [59]. Reportedly autophosphorylates on tyrosine [60]. High sequence variation, even at key motifs, suggests diverse functions. CAK 3997/1427 Choline and aminoglycoside kinases. Includes many novel subfamilies. Bacterial choline kinase (ChoK, licA), modifies LPS, enabling mucosal binding for several human-commensal and pathogenic bacteria [61]. Expression is controlled by phase variation. Metazoan choline kinases are involved in the production of phosphatidyl choline and acetylcholine, and metazoans also have related ethanolamine kinases. Aminoglycoside kinases (Aminoglycoside phosphotransferases, APH) phosphorylate and inactivate aminoglycosides, antibiotics that target the bacterial ribosome [62]. They are produced as antidotes by aminoglycoside-producing bacteria, and by many of their targets. HSK2 1649/93 One of several structurally distinct homoserine kinases, involved in threonine biosynthesis. Found mostly in a-proteobacteria, mirroring the distribution of HSK1 in c-proteobacteria. Unlike HSK1, it does not chromosomally cluster with other threonine biosynthesis genes, but is usually linked to the lytB gene involved in isoprenoid biosynthesis, to RNAse H1 and to clusters of novel genes (NK, GM, unpublished data). These suggest additional functions for this family. FruK 390/136 Fructosamine kinase. Initiates repair of aging proteins by phosphorylating residues damaged by glycosylation, leading to their repair [63]. Found in most eukaryotes and many bacteria, and may also have sugar kinase activities [64]. MTRK 144/38 MethylThioRibose kinase. Involved in a sulphur salvage pathway of methionine synthesis. Expression is controlled by methionine levels in K. pneumonia [65] and by starvation in B. subtilis [66]. Present in select bacteria and plants, but not in higher eukaryotes. UbiB 4110/623 UbiB (ABC1 in eukaryotes). Regulates the ubiquinone (co-enzyme Q) biosynthesis pathway in both prokaryotes and yeast [67,68]. It is speculated to activate an unknown mono-oxygenase in the ubiquinone biosynthesis pathway, possibly in response to aerobic induction. Ubiquitous in eukaryotes and widespread in bacteria. MalK 29/80 Maltose kinase. Contains two members shown biochemically to be maltose kinases [69]. Most public members are annotated as trehalose synthases, based on transitive annotation from a member that is fused to trehalose synthase. conserved between families, constituting a core pattern of conservation that helps define this superfamily (Table 2, Figure 2, Figure 3). These residues are conserved across the major divisions of life, which diverged one to two billion years ago, and across diverse families, which presumably diverged even earlier. Thus, they are likely to mediate core functions of the catalytic domain rather than merely maintaining their structures. Six of these residues are known to be involved in ATP and substrate binding and catalysis (G52, K72, E91, D166, N171 D184; residues numbered based on PKA structure 1ATP except where otherwise noted; see Table 3). The full functions of the other four remain unclear, though three of them (H158, H164, and D220) are part of a hydrogen-bonding network that links the catalytically important DFG motif with substrate binding regions ( Figure 2). The conservation of this network across diverse PKL structures suggested a role for this network in coupling DFG motif-associated conformational changes with substrate binding and release [16]. Despite this ancient conservation, different families of ePKs have lost individual members of this triad without destroying structure or catalytic function: H164 is changed to a tyrosine in PKA and many other AGC families; H158 is lost in most tyrosine kinases; and D220 is lost in the Pim family. The Pim1 structure retains an ePK-like structure, perhaps in part due to stabilization of the catalytic loop by the activation loop, a function normally performed by D220 [26], suggesting a novel mode of coupling ATP and substrate binding in this family. The individual loss of each member of this triad suggests that they have independent functions yet to be understood.

Sequence and Structural Diversity
Family-specific functions are mediated by features that are highly conserved within families, but that are divergent between families (Figure 4). Many family-selective residues map to the motifs surrounding the ten key residues, or to the divergent C-terminal substrate-binding region (Tables 2 and  S1). The proximity of these residues to the active site suggests that they are key in selecting substrates or tuning mechanism of action. For instance, the 4-amino acid (aa) stretch between the HxD 166 and N 171 residues is highly conserved but distinct between families (Figure 4), and provides a discriminative signature that defines each family. Within ePKs, tyrosine and serine/threonine-specific kinases display distinct patterns of conservation within this 4-aa stretch [27]. Serine/threonine kinases conserve a [LI]KPx motif within this stretch, while tyrosine kinases conserve a [LI]AAR motif. These variations alter the surface electrostatics of the substrate-binding pocket, thereby contributing to substrate specificity [27].
The C-terminal region of ;100 aa following the DFG motif is highly divergent between families, apart from the conserved D220 at the beginning of the F-helix ( Figure 2; Dataset S3). Secondary structure is generally predicted to be helical, but the poor sequence conservation and known structures [11] suggest that the overall orientation of the helices may be different between families. Notably, in the crystal structures of APH bound to its substrate, kanamycin [28], the relative positioning of the substrate-binding helices (aH-aI) is distinct from that of ePKs ( Figure 2). The presence of unique patterns of conservation in each family ( Table 2) also suggests that this region is involved in family-specific functions.
Several families contain sizeable (;30-100 aa) insert segments between core subdomains that are specific to clusters of families. Most CAK members have an insert segment between subdomains VIa and VIb. There is very little sequence similarity within this segment across CAK members, but structures of APH and ChoK indicate some structural similarity and highlight its role in substrate binding [28,29]. An equivalent insert is seen in the other CAK cluster families, FruK, HSK2, and MTRK. Similarly, KdoK and Rio contain an insert between subdomains II and III, which shows some sequence similarity between these families. In the Rio2 structure, this insert is disordered, but the presence of a conserved threonine suggests a possible regulatory role [14]. This region also contains an insert in the distinct UbiB family. Finally, the ePK, pknB, and HRK families contain an extended activation loop between subdomains VIII and IX. These kinases are generally activated by phosphorylation of this loop, the negative charge of which helps to coordinate key structural elements during the activation process, including a family-selective HRD arginine in the catalytic loop [30,31].

Mechanistic Diversity of the Catalytic Core
A surprising finding was that while ten key residues are conserved both within and between families, all but one of them was dispensable in one family or another (Figure 3), indicating that even catalytic residues are malleable in the appropriate context. Here we explore the effect of loss of the ''catalytic lysine'' K72, which typically positions the a and b phosphates of ATP ( Figure 5A). Mutation of this lysine in ePKs is a common method to make inactive kinases [32]. Yet this residue is conserved as an arginine (R111 ChoK ) in most CAK subfamilies, as a methionine in the CAK-chloro subfamily, and as a threonine in the related HSK2 family (Figure 4).
In the two major CAK subfamilies with a conserved R72 (FadE and choline kinase [ChoK]), we see correlated changes in the glycine-rich and DFG motifs (Figure 4). Specifically, the Phe and Gly within the GxGxFG motif (F54 and G55) are changed to Ser/Thr and Asn, respectively (S86 ChoK , N87 ChoK ), and G186 within the DFG motif is changed to E. Both the GxGxFG and DFG motifs are spatially proximal to K72 ( Figure 5A). Thus, correlated changes in these two motifs could structurally account for the K-to-R change. Indeed, in the ChoK crystal structure [13], N55 protrudes into the ATP binding pocket, and hydrogen bonds to R72. In addition, the conserved E91 in helix C, which typically forms a salt bridge with K72, is hydrogen bonded (via a water molecule) to the covarying E186, thus linking these three correlated changes and stabilizing R72 in a unique conformation ( Figure 5B). By contrast, the two solved APH structures (1ND4 and 2BKK) retain the ''ancestral'' sequence state with K72 and G186, and lack N55.
Mutation of R72 or E186 to alanine in ChoK reduces the catalytic rate by several fold [33]. To test the possible role of these residues in the ChoK catalytic mechanism, we modeled an ATP in the active site of ChoK (based on the nucleotidebound structures of APH and PKA). This revealed that R72 partially occludes the ATP binding site and is likely to move upon ATP binding. Notably, a K72-to-R mutation in Erk2 [34] also exhibits a conformational change in R72 upon nucleotide  Figure 5A). In this conformation, R72 could potentially hydrogen bond to both E91 as well as to the covarying E186 in ChoKs, which might explain the covariation of R72 and E186 in these families.

Variation on a Theme
Other CAK members display distinct coordinated changes at the G55, K72, and G186 positions. The chloro subfamily of CAK loses the positive charge at position 72 altogether, replacing it with methionine, and has concurrent changes to R55 and Q186 (Figure 4). This may reflect a shift of the positive charge from position 72 to 55, an event that also happened in Wnk kinases, the only functional ePK family that lacks K72. The conserved K55 of Wnks is required for catalysis and has been shown to interact with ATP similarly to K72 of PKA [35] (Figure 5D). Hence, two evolutionary inventions may have converted the same core motif residue from one function to another. In CAK-chloro, the unpaired E91 position loses its charge to become a conserved Phe. The function of this Phe is unknown, but is likely to be important since it is also conserved in HSK2, a related family, and the only other kinase family to conserve a Phe at the E91 position ( Figure 4).

Evolution of Conformational Flexibility and Regulation in ePKs
The ePK catalytic domain is highly flexible and undergoes extensive conformational changes upon ATP binding [36]. In contrast, crystal structures of APH, solved in both ATPbound and -unbound forms, revealed modest structural changes in the ATP-binding pocket [37]. This difference in conformational flexibility is reflected in the patterns of conservation at key positions within the ATP-binding glycinerich loop ( Figure 4). Specifically, two conserved glycines (G50 and G55), which contribute to the conformational flexibility of this loop in ePKs, are replaced by non-glycines in APH. These two glycines are absent in several PKL families ( Figure  4) while G52, which is involved in catalysis, is present in most, suggesting that the conformational flexibility of the nucleotide-binding loop is a feature of selected PKL families such as ePKs. Since conformational flexibility allows for regulation, it is likely that modest structural changes associated with nucleotide binding gradually evolved into quite dramatic structural rearrangements required to ensure that key players in various signaling pathways act only at the right place and at the right time. The conserved glycine (G186) within the catalytically important DFG motif may likewise have evolved for regulatory functions in ePKs [38]. This glycine is highly conserved in the ePK cluster but is absent from most other Across the ;250-aa domain, almost one-third of .90% conserved residues map to the ten key residues that are also conserved between families, and more than half map to these key residues or their surrounding motifs (GxGxxGxxxx, vaiK, E, vP, LxxLH, xxHxDxxxNxx, xxDxGxx, DLA; boldfacing indicates the ten key residues). An additional 15% are found in the largely unalignable region C-terminal of DxG, strongly suggesting family-specific functions, and 10% are semi-conserved, being found in some, but not most families. See Dataset S3 for details. doi:10.1371/journal.pbio.0050017.t002   b3 strand Hydrogen bonds to the a-oxygen and b phosphate of ATP [72]. E91 C-helix Forms a salt bridge interaction with K72 and functions as a regulatory switch in ePKs [73]. P104 aC-b4 loop Unknown function. Absent from ePKs. H158 C-terminus of E helix Hydrogen bonds to D220 in the F-helix and is part of a hydrogen bond network that couples ATP and substrate-binding regions [16].
Catalytic loop aE-b6 Hydrogen bonds to the backbone of the residue before the DFG-Asp and integrates substrate and ATP binding regions [16]. D166 Catalytic loop Catalytic base [72]. N171 Catalytic loop Coordinates the second Mg2 þ ion and involved in phosphoryl transfer. D184 N-terminus of activation loop Coordinates the first Mg2 þ ion. D220 N-terminus of F helix Hydrogen bonds to the backbone of the catalytic loop and positions this loop relative to substrate-binding regions [16]. doi:10.1371/journal.pbio.0050017.t003 PKL families. However, within the small subfamily of magnesium-dependent Mnk ePK kinases, G185 is changed to aspartate (DFD). In the Mnk2 crystal structure, this DFD motif adopts an ''out'' conformation in which F185 protrudes into the ATP-binding site. This is in contrast to the ''in'' conformation, where it packs up below the C-helix [39]. Mutation of the Mnk2 D186 ''back'' to glycine results in both in and out conformations of the DFG motif, supporting the role of G186 in DFG-associated conformational changes. Such conformational transitions may facilitate regulation of activity since the conformation of the catalytic aspartate is also changed during this transition [38]. This may also explain why the ePK-specific extended activation loop, which is phosphorylated and undergoes dramatic conformational changes, is directly attached to the DFG motif ( Figure 6A). In addition to the flexible catalytic core, the substratebinding regions appear to have evolved for tight regulation of ePK activity. In particular, the conserved G helix, which was recently shown to undergo a conformational changes upon substrate binding [40], is uniquely oriented in ePK/pknB ( Figure 6A). Several ePK-conserved residues and motifs are at the interface between the G helix and the catalytic core ( Figure 6B). These include the APE motif, located at the C-terminal end of the activation loop, a W-[SA]-X-[G] motif in the F-helix, and an arginine (R280), at the beginning of the I helix ( Figure 6B). These three motifs structurally interact with each other and form a network that couples the substrateand ATP-binding regions ( Figure 6B). This network also involves conserved buried water molecules, which are known to contribute to the conformational flexibility of proteins [41]. Thus, this ePK/pknB-conserved network may also facilitate regulation by increasing the conformational flexibility of the substrate-binding regions [16].

Discussion
Data from the GOS voyage provides a huge increase in available sequences for most prokaryotic gene families, enabling new studies in discovery, classification, and evolutionary and structural analysis of a wide array of gene families. Even for a eukaryotic family such as ePK kinases, GOS provides insights by greatly increasing understanding of related PKL families. GOS increases the number of known ELK sequences more than 3-fold, and has enabled both the discovery of novel families of kinases as well as a detailed analysis of conservation patterns and subfamilies within known families. We believe that the GOS data, coupled with the recent strong growth in whole-genome sequencing, provide the opportunity for similar insights into virtually every gene family with prokaryotic relatives.
PKL kinases are largely involved in regulatory functions, as opposed to the metabolic activities of other kinases with different folds [25]. The characteristics of this fold that lead to the explosion of diverse regulatory functions of eukaryotic ePKs have also been exploited for many different functions within prokaryotes. While these kinases reflect only ;0.25% of genes in both GOS and microbial genomes (ePKs represent ;2% of eukaryotic genes [42]), indicating a simpler prokaryotic lifestyle, they now outnumber the count of ;12,000 histidine kinases that we observe in GOS [22], suggesting that ELKs may be at least as important in bacterial cellular regulation as the ''canonical'' histidine kinases.
PKL kinases cross huge phylogenetic and functional spaces while still retaining a common fold and biochemical function of ATP-dependent phosphorylation. The presence of Rio and Bud32 genes in all eukaryotic and archaeal genomes suggests that at least this cluster dates back to the common ancestor of these domains of life. Similarly, the presence of UbiB in all eukaryotes and most bacterial groups, the close similarity of pknB/ePK families, and the widespread bacterial/eukaryotic distribution of FruK suggest their origins before the emergence of eukaryotes, or from an early horizontal transfer. Their ancient divergence leaves little or no trace of their shared structure within their protein sequence other than at functional motifs, which include a set of ten key residues that are highly conserved across all PKLs.
Despite the huge attention paid to ePKs, four key residues (P 104 , H 158 , H 164 , D 220 ), three of which are highly conserved in ePKs, are still functionally obscure and worthy of greater attention, both in ELKs and ePKs. Conversely, it appears that nine of the ten key residues have been eliminated or transformed in individual families while maintaining fold and function, showing that almost anything is malleable in evolution given the right context. That right context is frequently a set of additional changes in the family-specific motifs surrounding these key residues, and we see that in the case of K72, a substitution to arginine triggers a cascade of other core substitutions that serve to retain basic function, while a substitution to methionine involves a shift of the positive charge normally provided by K72 to another conserved residue, in both CAK-chloro and Wnk kinases. Other core changes are also seen independently in very distinct families, such as the G55-to-A change in UbiB and the chloro subfamily of CAK, or the E91-to-F change in both chloro and HSK2, suggesting that these kinases are sampling a limited space of functional replacements.
These families vary greatly in diversity. While the ePK family has expanded to scores of deeply conserved functions [42], other families, including Bud32, Rio, Bub1, and UbiB, usually have just one or a handful of members per genome, suggesting critical function but an inability to innovate. The largely prokaryotic CAK family is also functionally and structurally diverse, containing several known functions and many distinct subfamilies likely to have novel functions. The diversity of both CAK and KdoK sequences may be related to their involvement in antibiotic resistance and immune evasion, likely to be evolutionarily accelerated processes. Comparison of CAK to the related and more functionally constrained HSK2, FruK, and MTRK families may reveal adaptive changes such as the ePK-specific flexibility changes that may assist in its diversity of functions.
GOS data are rich in highly divergent viral sequences, and accordingly we find a number of new subfamilies of viral kinases, including two of the three subfamilies of HRK and a subfamily of CapK. In both cases we see loss of N-terminal-conserved elements, suggesting that these kinases may have alternative functions or even act as inactive competitors to host kinases.
These patterns of sequence conservation and diversity raise many questions that can only be fully addressed by structural methods. The combination of structural and phylogenetic insights for ChoK enabled insights that were not clear from the structure alone, and enabled us to reject other inferences from the crystal structure that were not conserved within this family, highlighting the value of combining these approaches. The relative ease of crystallization of PKL domains, the emergence of high-throughput structural genomics, and our understanding of the diversity of these families make them attractive targets for structure determination of selected members, and position this family as a model for analysis of deep structural and functional evolution.

Materials and Methods
Discovery and classification of kinase genes. Sequences used consisted of 17,422,766 open reading frames from GOS, 3,049,695 predicted open reading frames from prokaryotic genomes, and 2,317,995 protein sequences from NCBI-nr of February 10, 2005, as described [22]. Profile HMM searches were performed with a Time Logic Decypher system (Active Motif, http://timelogic.com) using inhouse profiles for ePK, Haspin, Bub1, Bud32, Rio, ABC1 (UbiB), PI3K, and AlphaK domains, as well as Pfam profiles [43] for ChoK, APH, KdoK, and FruK, and TIGRFAM profiles [44] for HSK2 (thrB_alt), UbiB, and MTRK. A number (69) of additional ePK-annotated models from Superfamily 1.67 [45] were used to capture initial hits but not for further classification. Initial hits were clustered and re-run against all models, and each model was rebuilt and rerun three to seven times using ClustalW [46], MUSCLE [47], and hmmalign (http://hmmer. janelia.org) to align, followed by manual adjustment of alignments using Clustal and Pfaat [48] and model building with hmmbuild. Lowscoring members of each family (e . 1 3 10 À5 ) were used as seeds to build new putative families, and profile-profile and sequence-profile alignments were used to merge families into a minimal set (Dataset S2). A motif-based Markov chain Monte Carlo multiple alignment model [49] based on the conserved motifs of Figure 3 was run independently and used to verify HMM hits and seed new potential families for blastbased clustering, model building, and examination for conserved residues. Final family assignment was by scoring against the set of HMM models, with manual examination of sequences with borderline scores (e . 1 3 10 À5 or difference in e-values between best two models ..01).
Family annotations. Annotations of chromosomal neighbors used SMART [50] and a custom analysis of GOS neighbors ( [22]; C. Miller, H. Li, D. Eisenberg, unpublished data). Annotation analysis was based on GenBank annotations and PubMed references. Taxonomic analysis used a mapping of GOS scaffolds to taxonomic groupings [22] and NCBI taxonomy tools.
Family alignments and logos. Residue conservation (Dataset S3) was counted from the final alignment using a custom script that omitted gap counts. These counts were then used to construct family logos using WebLogo (http://weblogo.berkeley.edu; [51]).
Family comparisons. Relatedness between families was estimated using several methods. HMM-HMM alignments and scores were computed using PRC (http://supfam.org/PRC), and sequence-profile alignments using hmmalign were analyzed using custom scripts and by inspection. Both full-length and motif multiple alignments were also created and used for the family comparisons.

Supporting Information
Dataset S1.  Dataset S3. Domain Profiles for 20 PKL Families These 20 spreadsheets show the conservation profile at each residue of the kinase domain for each family, including annotations and classifications of individual residues. Each worksheet details the alignment of one kinase family to its HMM. Every row corresponds to a position within the alignment, listing the four most common amino acids (aa) in that row along with their fractional popularity. The number of aa's and number of gaps at that position within the alignment is also listed. The ''Notes'' column annotates conservation status of selected residues and other notes, while the ''.90% Conserved'' annotates those corresponding residues as to their class (Core, Motif, Motif-Associated, Semi-Conserved, C-terminal, Unique, or external to the kinase domain). A number of color highlights are used. (1) Positions with few aa's in the alignment (typically inserts within the domain that are not of great interest) are shaded gray: typically dark gray for 20 aa at that position, and light gray for .20 but still low (the range varies depending on the depth of the alignment). Rows highlighted in gray have no highlights in any other columns and are assumed not to be part of the core domain.