Carbohydrate Recognition by an Architecturally Complex α-N-Acetylglucosaminidase from Clostridium perfringens

CpGH89 is a large multimodular enzyme produced by the human and animal pathogen Clostridium perfringens. The catalytic activity of this exo-α-d-N-acetylglucosaminidase is directed towards a rare carbohydrate motif, N-acetyl-β-d-glucosamine-α-1,4-d-galactose, which is displayed on the class III mucins deep within the gastric mucosa. In addition to the family 89 glycoside hydrolase catalytic module this enzyme has six modules that share sequence similarity to the family 32 carbohydrate-binding modules (CBM32s), suggesting the enzyme has considerable capacity to adhere to carbohydrates. Here we suggest that two of the modules, CBM32-1 and CBM32-6, are not functional as carbohydrate-binding modules (CBMs) and demonstrate that three of the CBMs, CBM32-3, CBM32-4, and CBM32-5, are indeed capable of binding carbohydrates. CBM32-3 and CBM32-4 have a novel binding specificity for N-acetyl-β-d-glucosamine-α-1,4-d-galactose, which thus complements the specificity of the catalytic module. The X-ray crystal structure of CBM32-4 in complex with this disaccharide reveals a mode of recognition that is based primarily on accommodation of the unique bent shape of this sugar. In contrast, as revealed by a series of X-ray crystal structures and quantitative binding studies, CBM32-5 displays the structural and functional features of galactose binding that is commonly associated with CBM family 32. The functional CBM32s that CpGH89 contains suggest the possibility for multivalent binding events and the partitioning of this enzyme to highly specific regions within the gastrointestinal tract.


Introduction
Mucins are heavily O-glycosylated glycoproteins that act to protect the epithelia from harmful bacteria by forming a biophysical barrier to infection as well as supporting innate and adaptive immunity [1]. A heavily hydrated and highly viscous protective mucosal layer can be found lining the surface of the major entry points to our body, including the eyes, the nasopharynx, the genito-urinary tract and the gastrointestinal tract. Within the gastrointestinal tract the mucin layer can vary from 700 mm deep in the stomach to 150-300 mm deep in the small intestine [2]. Pathogens of the gastrointestinal tract, such as Clostridium perfringens, must find ways to subvert or somehow challenge this protective mucosal barrier in order to set up infection.
C. perfringens' niche environment is in the gut of animals, including humans, where it may reside harmlessly; however, infection with a pathogenic strain can cause gastroenteritis and, in serious cases, substantial intestinal tissue destruction associated with necrotic enteritis. Among the enzymes that C. perfringens employs to cope with the mucosal surface are the glycoside hydrolases, which have varying catalytic specificities that reflect the diversity in host glycans; these include, but are not limited to, neuraminidases (GH33) [3,4], exo-and endo-b-N-acetylglucosami-nidases (GH84 and GH85) [5,6,7], an endo-a-N-acetylgalactosaminidase (GH101) [8,9], as well as CpGH89, which is an exo-a-Nacetylglucosaminidase [10,11]. Due to the significant genome content of genes encoding carbohydrate-active enzymes with known or suspected specificity for complex glycans, such as those found on the mucosal surface, it has been postulated that these enzymes play an important role during colonization and/or infection. Indeed, enzymatic preparations of C. perfringens, in combination with mild acid hydrolysis, have previously been used to help partially ''untangle'' the complex carbohydrate surface lining the gut supporting the concept that the structure of gastrointestinal mucosa can be influenced by these bacterial factors [12].
Within the gastric mucosa there are two types of mucous cells, surface mucous cells and the deeper gland mucous cells, producing two different mucins which combine together to form a stratified surface mucous layer [13]. Class III mucins are produced normally by the gastric gland mucous cells, duodenal Brunner's gland mucous cells, and the mucous cells of the accessory glands of pancreaticobiliary tract but also in certain tissues exhibiting gastric metaplasia or adenocarcinoma [14][15][16][17][18][19][20][21][22][23]. The class III mucins, discharged by gland mucous cells in the gastric pits [13], are somewhat distinct in that they are specifically decorated with peripheral a-GlcNAc (a-N-acetyl-D-glucosamine) residues forming GlcNAc-a-1,4-Gal-b-R (N-acetyl-b-D-glucosamine-a-1,4-D-galactose) motifs [19,22,24]. The biological relevance of this carbohydrate motif is at present not clear; however, terminal a-linked GlcNAc has been implicated as a host defense mechanism against colonization of the gastric mucosa by Helicobacter pylori [25] by blocking production of CGL (cholesteryl-a-D-glucopyranoside), an important component of this bacterium's cell wall.
C. perfringens is unusual in its ability to process the GlcNAc-a-1,4-Gal motifs found in class III mucin. CpGH89 (EC 3.2.1.50, CPF_0859), also referred to as AgnC [11], is a family 89 a-Nacetyl-D-glucosaminidase that has been shown to specifically release terminal a-linked GlcNAc from the disaccharide GlcNAc-a-1,4-Gal and demonstrated to liberate GlcNAc from crude class III porcine gastric mucin [10,11]. Using a cpgh89 mutant of C. perfringens the activity of CpGH89 has been linked to the ability of C. perfringens to grow on mucin bearing this rare carbohydrate motif [11].
Two remarkable features of CpGH89 are its overall size (2095 amino acids) and its extensive multimodularity. Overall, the enzyme comprises a glycoside hydrolase family 89 (GH89) catalytic module, four FIVAR (found in various molecular architectures) modules, an unknown module, a C-terminal fibronectin type III-like (FN3-like) module, and six putative carbohydrate-binding modules (CBMs) (Figure 1). CBMs are generally defined as non-catalytic modules that bind carbohydrates and are found within the modular architectures of carbohydrate-active enzymes [26], thus distinguishing these modules from lectins and carbohydrate-specific antibodies. CBMs are presently classified into over 60 amino acid sequenced based families; the CBMs from CpGH89 all belong to CBM family 32, which is one of the most diverse CBM families [7].
Based on truncation studies of the enzyme and structural analyses of the N-terminal modules, the catalytic activity of the enzyme allowing it to release GlcNAc from class III mucin is attributed to its GH89 module [10,11]. Similar truncation studies that focused solely on CBM32s 2 to 6 revealed one or more of these CBMs to be able to bind mucin [11]. Notably, constructs of CpGH89 lacking the three most C-terminal CBMs had reduced activity on mucin suggesting an important role for the CBMs in substrate recognition. Thus, CpGH89 possesses a complex multimodular architecture where the composite modules function together to efficiently act on components of mucin. Though it is clear that the CBMs are able to bind mucin what remains unknown is what carbohydrate motifs displayed on mucin, particularly the unique GlcNAc-a-1,4-Gal motif, may be recognized by the CBM32s and what the molecular bases of these interactions are. Here we address these questions through structural and functional analyses of the CBMs from CpGH89. Overall, these studies reveal the specificity of three of CBM32s and, through X-ray crystal structures, how two of the CBMs accommodate their ligands, which includes the first GlcNAca-1,4-Gal binding specificity for a protein other than an antibody.

Analysis of a galactose binding CBM
Of the six putative CBM32s in CpGH89 CBM32-5, the fifth CBM, has the highest similarity with modules known to have carbohydrate-binding function (,43% amino acid sequence identity with the CBM32 from the large sialidase NanJ, also from C. perfringens). Furthermore, the strict conservation of residues involved in galactose recognition suggested that CBM32-5 belongs to the galactose binding group of family 32 CBMs [7,27,28]. CBM32-5 was initially screened for carbohydrate binding on glycan microarrays. Binding was generally quite weak; however, two galactose terminating N-glycans, one tri-antennary and the other tetra-antennary, gave significant binding signal ( Figure 2A). Likewise, two glycans terminating with GalNAc, one a-1,4-linked and the other b-1,3-linked, also gave good signals. Though this did not conclusively single out a single carbohydrate ligand it is generally consistent with predictions of galactose specificity based on amino acid sequence similarity. This suggested binding to terminal galactose and GalNAc residues, which was used as a guide to quantitatively assess binding to carbohydrate ligands.
The addition of galactose or GalNAc to CBM32-5 perturbed the UV absorption of this protein in a manner consistent with the involvement of tyrosine residues in carbohydrate binding [29]( Figure 2B). This signal was used in a quantitative manner to assess binding to a variety of carbohydrate ligands ( Figure 2C and Table 1). The association constants of CBM32-5 binding to ligands containing galactose or GalNAc were in the range of 2-5610 3 M 21 (Figure 2B, 2C and Table 1), and thus quite weak, but of the same magnitude observed for other family 32 CBMs [3,27,[30][31][32]. The CBM displayed little to no preference for either galactose or GalNAc and did not appear to significantly favor common disaccharide motifs that terminate in galactose or GalNAc over the monosaccharides (Table 1).
The structural basis for what appears to be a general selectivity for terminal galactose residues was examined by determining the X-ray crystal structure of CBM32-5 in complex with carbohydrate. The 1.55 Å resolution structure of the CBM binding galactose revealed the b-sandwich fold with structural metal ion, in this case modeled as a Ca 2+ , which is common to the family ( Figure 3A). The galactose residue was well-ordered in the crystal structure providing clear electron density ( Figure 3B). The site accommodating this carbohydrate is a shallow cleft marked by two solvent exposed aromatic side chains, F1483 and Y1395 ( Figure 3C), which is present in the loops at the edges of the bsandwich ( Figure 3A). The C6-OH group of galactose fits into a corner of the binding site made up by F1483 and Y1395, whose aromatic rings are at nearly right angles to one another ( Figure 3C and 3D). A series of hydrogen bonds involve the side chains of four amino acids in the carbohydrate-binding site ( Figure 3D). With the exception of E1376, which makes hydrogen bonds with the C3 hydroxyl group of galactose, all of the interactions are highly conserved with other known galactose binding CBMs ( Figure 3E). Indeed, the interactions made by the five residues H1392, Y1395, R1423, N1428, and F1483 make up the canonical galactosebinding motif in the family 32 CBMs [7,27,32]. CBM32-5, therefore, possesses a galactose-binding site; however, it is also capable of binding GalNAc equally well. Furthermore, the analysis of the CBM32 from NagJ, indicated that the recognition of longer glycans by CBM32s can involve additional subsites [27]. The structures of CBM32-5 in complex with other potentially biologically relevant ligands, GalNAc, the Tn-antigen, and GalNAc-b-1,3-Gal ( Figure 4A, 4B and 4C) show the recognition of terminal GalNAc residues to be identical to that of galactose, with the addition of a water mediated hydrogen bond involving the acetamido group of the carbohydrate and the backbone nitrogens of K1427 and N1428 ( Figure 4D). This limited additional interaction appears to provide little to no favorable energy to binding. Likewise, the galactose of the GalNAc-b-1,3-Gal extended away from the protein surface and made no  interactions with the protein, which is consistent with the lack of improved binding for this disaccharide over GalNAc. The same observation was made for the serinyl group of the Tn-antigen, even though the serine is a-linked to GalNAc. Modeling other common a-linked carbohydrates, such as Gal-a-1,3-Gal, based on the Tn-antigen complex suggested that these additional residues also extend out into solvent with no capacity to make additional interactions with the protein (not shown).
The crystallography results suggest that CBM32-5 is relatively promiscuous in that it requires only a terminal galactose or GalNAc residue with little preference for the sugar that precedes it. The glycan microarray results, however, suggested a strong interaction with a unique carbohydrate, GalNAc-a-1,4(Fuc-a-1,2)-Gal-b-1,4-GlcNAc. This interaction was reproducible on glycan microarrays, even when using CBM that was directly labeled by chemically coupling the fluorophore to primary amines on the CBM (not shown). To our knowledge, this glycan has not been identified in any mammalian tissues; however, this synthetic carbohydrate was clearly the top ligand from the array analysis suggesting that an analysis of the interaction of CBM32-5 with this carbohydrate may provide insight into the recognition of more complex but as yet unstudied glycans. A molecular dynamics approach was used to study the potential interaction of GalNAc-a-1,4(Fuc-a-1,2)-Gal-b-1,4-GlcNAc-OMe with CBM32-5. The resulting analysis gave an ensemble of ten structures with each structure representing a group of similar, energy-minimized structures ( Figure 4E). Overall, the carbohydrate in the ten structures adopts an array of potential conformations, though the terminal GalNAc residue and the preceding Gal residue are somewhat constrained in their positions. A representative of the lowest energy group of models shows the carbohydrate to adopt a conformation that, by virtue of the bent conformation imparted by the a-1,4-linkage between the GalNAc and Gal, bends around Y1395 and allows the reducing-end portion of the glycan to rest against the protein surface with only a very small number of additional hydrogen bonds made ( Figure 4F). Free energy  -weighted 2F obs -F calc maps contoured at 1 s (both maps at 0.31 e 2 /Å 3 ) produced by refinements prior to modeling the sugar (green) and with the sugar included (blue). (B) Electron density for serinyl-Tn antigen shown as maximum-likelihood/s A [59] -weighted 2F obs -F calc maps contoured at 1 s (both maps at 0.45 e 2 /Å 3 ) produced by refinements prior to modeling the sugar (green) and with the sugar included (blue). (C) Electron density for GalNAc-b-1,3-galactose shown as maximum-likelihood/s A [59] -weighted 2F obs -F calc maps contoured at 0.8 s (both maps at 0.34 e 2 /Å 3 ) produced by refinements prior to modeling the sugar (green) and with the sugar included (blue). (D) Divergent stereo view of the key interactions between the binding site of CBM32-5 and GalNAc. This also represents the mode of interaction between the CBM and the serinyl-Tn antigen and GalNAc-b-1,3galactose, which all have identical hydrogen bonding patters. Hydrogen bonds are shown as dashed black lines. (E) Models of CBM32-5 in complex with GalNAc-a-1,4(Fuc-a-1,2)-Gal-b-1,4-GlcNAc-OMe produced by molecular dynamics simulations. An ensemble of ten energy minimized models is given with each model representing a group of energetically similar models. Relevant residues in the binding site are shown as grey sticks with the backbone of the protein shown as a Ca-ribbon. (F) A surface representation of the lowest energy model of CBM32-5 bound to the tetrasaccharide (GalNAc is shown in green, fucose in pink, galactose in yellow, and GlcNAc in blue). The surfaces contributed by Y1395 and F1483 are shown in magenta and additional hydrogen bonds made outside of the primary galactose binding site shown as dashed lines. doi:10.1371/journal.pone.0033524.g004 decomposition shows the increased affinity of this ligand for CBM32-5 results from the increased van der Waals and non-polar solvation interactions that is imparted by the complementary interacting surface areas of this unique carbohydrate ligand and the CBM surface. This interaction is specifically enhanced by interactions between the fucosyl residue and residues Y1395 and N1396 of CBM32-5 ( Figure S1). Though GalNAc-a-1,4(Fuc-a-1,2)-Gal-b-1,4-GlcNAc may not be a biologically relevant ligand for CBM32-5 its mode of interaction with this CBM suggests that other high affinity ligands, perhaps not represented on the carbohydrate microarrays, may be possible provided they adopt a conformation that maximizes the interacting surface areas.

Carbohydrate-binding modules with unique specificity
Though CpGH89 has at least one functional CBM its specificity (i.e. galactose and GalNAc) is clearly mismatched with the specificity of the catalytic module. Furthermore, this CBM is an outlier among the CpGH89 CBMs as it has higher amino acid sequence identity with CBMs from other enzymes than it does with the remaining CBMs from CpGH89. In contrast, CBM32-2, CBM32-3, and CBM32-4 form a distinct cluster in the phylogenetic analysis of the CBM32 family [7]. Indeed, CBM32-3 and CBM32-4 share 63% amino acid sequence identity and CBM32-2 has ,30% amino acid identity with these two CBMs ( Figure 5). These putative CBMs have very low amino acid sequence identity with CBM32-5 and other known CBM32s suggesting they may represent a new functional class of CBM32s. Isolated CBM32-2, CBM32-3, and CBM32-4 were screened for binding on the glycan microarrays. CBM32-3 gave statistically meaningful binding (i.e. signal with standard errors of the mean that indicated significant binding above background) with the top hits terminating in GlcNAc-a-1,4-Gal ( Figure 6A). Unfortunately, the results for CBM32-2 and CBM32-4 were inconclusive; however, the high amino acid sequence similarity between CBM32-3 and CBM32-4 suggested that both CBMs may have the same ligand, GlcNAc-a-1,4-Gal. Indeed, using ITC, the association constant of CBM32-4 for GlcNAc-a-1,4-Gal was determined to be 1.38 (60.08)610 4 M 21 thus showing this to be a relatively strong interaction for a family 32 CBM ( Figure 6B). The titration of GlcNAc-a-1,4-Gal into CBM32-3 also produced a binding isotherm consistent with carbohydrate binding and the association constant was determined to be 2.64 (60.64)610 4 M 21 ( Figure 6C). Thus, both CBM32-3 and CBM32-4 appear to have binding specificity for GlcNAc-a-1,4-Gal, which is complementary to the specificity of the catalytic module.
The ability of CBM32-3 and CBM32-4 to bind the GlcNAc-a-1,4-Gal is unique among non-catalytic carbohydrate binding proteins prompting the study of the molecular basis of this interaction. Of the two CBMs, crystals were only obtained of CBM32-4. The structure of seleno-methionine labeled CBM32-4 was determined by single anomalous dispersion to 1.55 Å resolution. This CBM adopts a b-sandwich fold with conserved structural metal ion, modeled as a calcium atom, which is similar to that of CBM32-5 (root mean square deviation of 1.9 Å over 112 matched Ca) ( Figure 7A). CBM32-4 was co-crystallized with GlcNAc-a-1,4-Gal and this structure determined to 2.8 Å resolution ( Figure 7B). Both molecules of CBM32-4 in the asymmetric unit had bound disaccharide as revealed by clear electron density for the sugar located in the loops at the edges of the b-sandwich core ( Figure 7B, C, and D). CBM32-4 accommodates the disaccharide in a shallow depression; the sugar, with its bent conformation, lies on edge in the depression with the B-face of the galactose residue pushed up against the planar surface of the W1333 side chain. Though there are no aromatic residues present on the adjacent wall of the binding site, it is at roughly right angles to the plane of the W1333 side chain and thus well positioned to pack against the A-face of the GlcNAc residue. Markedly few hydrogen bonds are made between the sugar and binding site suggesting that binding and specificity for this disaccharide is driven primarily by hydrophobic and van der Waals forces and accommodation of the unique carbohydrate conformation. O1 of the galactose is completely exposed and oriented out into the bulk solvent illustrating how the CBM might tolerate extensions on the reducing end of the GlcNAc-a-1,4-Gal motif, which is consistent with binding to the glycan microarrays and to the recognition of the motif as it would naturally be displayed at the termini of glycans on mucin. The O3 and O4 groups on the terminal GlcNAc, though solvent exposed, lie very close to the protein surface. It is unclear whether modification to these could be tolerated by the CBM, thereby allowing it to recognize internal GlcNAc-a-1,4-Gal motifs, but the proximity to the protein surface and steric clashes that would likely ensue suggests that this is unlikely. The C6 hydroxyl group is buried in the base of the binding site and thus extension with additional sugar residues would not be tolerated.
CBM32-3 was recalcitrant to crystallization preventing structural analysis by X-ray crystallography and direct examination of its interaction with carbohydrate; however, the main residues involved in GlcNAc-a-1,4-Gal recognition by CBM32-4 are conserved in CBM32-3 ( Figure 5). Taking further advantage of the high amino acid sequence identity of the two CBMs, a homology model of CBM32-3 was constructed; this revealed not only conservation of the primary binding site residues but also the majority of the residues lining the binding site ( Figure 7F), indicating that the mode of carbohydrate recognition by CBM32-3 is likely extremely similar to that of CBM32-4.
To date, the structural analysis of family 32 CBMs found in carbohydrate-active enzymes has revealed two subtypes of CBMs within the family: the 'canonical' galactose binding CBM32s, such as CBM32-5, and the unique GlcNAc binding CBM32 as represented by the CBM from NagH, NagHCBM32-2 [31]. A comparison of the amino acids involved in ligand binding from CBM32-4 with the binding sites of both of these CBM32 subtypes shows them to have no similarities in carbohydrate recognition beyond the general placement of the active sites ( Figure 7G and 7H). Thus, the GlcNAc-a-1,4-Gal binding CBMs, CBM32-3 and CBM32-4, represent a new mode of carbohydrate recognition by the CBM32s and continue to highlight the diversity within this family of CBMs.
Glycan microarray binding experiments with CBM32-2 were inconclusive, as were other low-throughput experiments to identify potential ligands, and attempts at crystallization did not yield crystals of sufficient quality for structure determination. To provide some insight into the potential capacity of this module to interact with carbohydrate a homology model based on the structure of CBM32-4 was constructed. Though the residues in CBM32-4 that impart carbohydrate binding function are not conserved with CBM32-2 ( Figure 5) the model reveals a pocket in the protein surface located in loops that usually contain the binding sites of CBM32s ( Figure 8A and 8B). This pocket contains a solvent exposed aromatic amino acid, Y1046, and a series of exposed planar polar amino acid side chains ( Figure 8B). These features are generally consistent with the properties of carbohydrate binding sites in CBMs, suggesting that this module is indeed capable of recognizing an as yet unidentified sugar.

CBM32-1 and CBM32-6 appear to lack carbohydratebinding function
Despite the observation that CBM32-1 and CBM32-6 display only 26% amino acid identity ( Figure 5) they cluster together in a phylogenetic analysis of CBM32 modules indicating that they are more closely related to one another than to other putative CBMs [7]. Qualitative UV difference scans on CBM32-1 and CBM32-6 did not suggest binding to any simple monosaccharides (galactose, GalNAc, mannose, sialic acid, GlcNAc or glucose). CBM32-1 was also screened on glycan microarrays but significant binding was not detected. The structure of CBM32-6 was determined to 1.55 Å resolution using SAD and seleno-methionine substituted protein (not shown). This structure compared with CBM32-1, previously determined as part of a construct including the catalytic module [10], gave a root mean square deviation of 1.8 Å over 119 Ca atoms. Neither CBM32-1 nor CBM32-6 have any exposed aromatics in the region of the protein known to contain the binding sites in CBM32 proteins ( Figure 9). Furthermore, a more thorough analysis of the surface residues of CBM32-1 and CBM32-6 showed them both to lack features consistent with carbohydrate binding sites. This observation, along with the lack of experimental support for carbohydrate binding, suggest that CBM32-1 and CBM32-6 do not function as CBMs, which perhaps explains their somewhat outlying position in the phylogenetic analysis of CBM32 modules [7].

The modular diversity of CpGH89 and its implications
In order to colonize the gastrointestinal tract organisms must first infiltrate the mucosal surface. For example, the secreted mucosal surfaces of the colon are comprised of mainly Muc2, which forms both the thick outer mucous layer, that plays host to many commensal microbes, and the thin inner mucous layer that is impervious to bacteria [33,34]. GlcNAc-a-1,4-Gal is displayed by the deeper gastric-type mucosal class III mucins, Muc5Ac and Muc6 [19] and the catalytic activity of CpGH89 is directed at this specific carbohydrate structure. Furthermore, two of the CBMs in this enzyme, CBM32-3 and CBM32-4, have evolved binding specificity complementary to the catalytic specificity. In a manner consistent with the generally proposed role of CBMs [26], CBM32-3 and CBM32-4 likely direct the enzyme to the secreted class III mucins within the deep mucosa of the stomach and duodenum, and in doing so promote substrate degradation by the catalytic module. The presence of two CBMs with the same specificities indicate the potential for a multivalent interaction, thereby increasing the overall apparent affinity of the enzyme for regions that display clusters of the GlcNAc-a-1,4-Gal motif.
Of the six CBM32-like modules that CpGH89 possesses two do not appear to bind carbohydrate (their functions, if they have any, remain unknown), one has putative carbohydrate-binding function (CBM32-2), and the remaining three clearly have carbohydratebinding function (CBM32-3, CBM32-4 and CBM32-5). The specificity of CBM32-5 appears to be primarily for terminal galactose and GalNAc residues and thus does not match the substrate preference of the catalytic module. Such mismatching between CBMs and their cognate catalytic modules is not unusual with C. perfringens glycoside hydrolases [3,27]. The biological reason for the presence of the mismatched CBMs remains speculative; however, it has been postulated that the presence of such CBMs may allow the enzyme to remain adhered to carbohydrate rich surfaces after the catalytic module has begun processing the substrate. For example, after hydrolysis of the GlcNAc-a-1,4-Gal substrate by the catalytic module of CpGH89 the remaining terminal sugar is a galactose residue and thus a Figure 5. Amino acid sequence comparison of the CBM32 modules from CpGH89. The secondary structure is shown above (CBM32-4) and below (CBM32-5) with arrows representing b-strands and cylinders a-helices. The purple and yellow triangles above and below the sequences indicate the aromatic and hydrogen bonding residues, respectively, that are involved in carbohydrate binding by CBM32-4 (top) and CBM32-5 (bottom). Numbers with the triangles indicate the residue number. Residues in CBM32-2 that are highlighted by boxes are those present in the putative binding site of this module. doi:10.1371/journal.pone.0033524.g005 potential ligand of CBM32-5. There then exists the potential for multivalent interactions involving heterogeneous clusters of ligands, such as combinations of the GlcNAc-a-1,4-Gal motif and terminal galactose and GalNAc residues. Alternatively, it has been hypothesized that the majority of the C. perfringens glycoside hydrolases, including CpGH89, are either covalently or noncovalently associated with the bacterial surface [6]. Thus, though the intrinsic affinity of a single CBM32-5 module for terminal  galactose residues is quite low and on its own would be unlikely to mediate significant adherence of soluble CpGH89 to terminal galactose residues, the possible context of bacterial surface association of the entire enzyme creates further potential for avid binding.
Overall, the presence of at least three functional CBMs in CpGH89, with a fourth likely, imparts diversity in the ability of this enzyme to recognize carbohydrate substructures and potential for increased affinity through multivalent interactions. As a secreted enzyme this capability would enhance the overall association of the enzyme with class III mucins. In the possible case that CpGH89 is immobilized on the bacterial cell-surface the enzyme's capacity to bind carbohydrate would impart considerable carbohydrate-adhesive capacity to the bacterium thus promote the tight interaction of this bacterium with its host.
All of the proteins were produced recombinantly in E. coli BL21(DE3) and purified by immobilized metal affinity chromatography and size exclusion chromatography (SEC) using methodologies described in detail previously [5]. Seleno-methionine-labeled CBM32-4 and CBM32-6 was produced as above using E. coli B834 (DE3) as the expression strain (Novagen). The media containing seleno-methionine was prepared according to the instructions of the manufacturer (Athena Enzyme). Protein concentrations were determined at 280 nm using calculated extinction coefficients [35]

Glycan microarray screening
Glycan microarray screening was performed by Core H of the Consortium for Functional Glycomics (www.functionalglycomics.  org/). CBMs were labeled by coupling to Alexa FluorH 488 labeled streptavidin via a biotin-NTA:Ni 2+ linker using methods identical to those described previously [36]. Labeled proteins were desalted using PD-10 columns (GE Healthcare) and used to probe the printed glycan arrays according to the standard procedures of Core H of the Consortium for Functional Glycomics.

Binding studies
Qualitative UV difference scans were performed using methods identical to those described previously [31]. Quantitative UV difference titrations were also performed using methods already described [27]. The concentration of protein used for the titrations was 31.5 mM in 20 mM Tris-HCl pH 8.0. The concentrations of carbohydrate stocks used to titrate into protein varied between ,40 mM and 45 mM and were prepared by mass in 20 mM Tris-HCl pH 8.0. Experiments were performed at 25uC in triplicate.
Isothermal Titration Calorimetry was performed as described previously using a VP-ITC (MicroCal, Northampton, MA) [27]. Proteins were filtered and degassed prior to use. Carbohydrate solutions were prepared by mass in buffer saved from dialysis of the appropriate protein. These solutions were also filtered and degassed prior to use. The proteins concentrations used varied from ,100 mM to ,550 mM. However, in no case could a protein concentration be used that exceeded the K d by more than five-fold (i.e. C-values were less than 5), thus, data was fit with a single binding site model using MicroCal Origin software (version 7.0) with the stoichiometry (n-value) fixed at 1. Experiments using CBM32-5 were performed in 20 mM Tris-HCl, pH 8.0, and those with CBM32-3 and CBM32-4 in 50 mM HEPES, pH 7.5. Experiments were performed at 25uC in triplicate.

Crystallization
Prior to crystallization, CBMs generally required overnight treatment with thrombin followed by re-purification by SEC to remove the 6-histidine tag. The complex of CBM32-4 with GlcNAc-a-1,4-Gal, however, was obtained with protein still having the 6-histidine tag. All crystallization experiments were performed at 18uC using the hanging drop vapour diffusion method.

Data collection, Structure Solution and Refinement
Diffraction data were collected at 100 K at the National Synchrotron Light Source (NSLS) beamline X8-C, the Stanford Synchrotron Radiation Laboratories (SSRL) beamline BL 9-2, or a home source comprising a Rigaku R-AXIS IV++ area detector coupled to a MM-002 X-ray generator with Osmic ''blue'' optics and Oxford Cryostream 700 as indicated in Tables 3 and 4. Data were processed using d*trek or MOSFLM [37,38].
The structures of CBM32-4 and CBM32-6 were solved by single-anomalous dispersion (SAD) experiments optimized for selenium (see Table 4 for wavelengths at which SAD data were collected). The heavy atom substructures were determined from the SAD data using the program ShelXC/D, while phasing was performed using ShelxE [39]. CBM32-4 crystallized with a single molecule in the AU; three of its potential four selenium sites were found and used for phasing. CBM32-6 crystallized with a two molecules in the AU with each monomer having two potential selenium sites; only one selenium site per monomer was found and used for phasing. Density modification with the program DM [40,41] was used to improve the phases prior to model building.  ARP/wARP [42] was able to build almost complete models, which were completed by manual model building with COOT [43]. Structural refinement of CBM32-6 (selenium derivative) was performed with PHENIX [44] refine using simulated annealing interspersed with manual building in COOT [43]. REFMAC [45] was used to refine CBM32-4. The structure of CBM32-4 in complex with GlcNAc-a-1,4-Gal was solved by molecular replacement using PHASER [46] to find the two molecules in the asymmetric unit. The model was completed by manual building with COOT and refinement with REFMAC; TLS parameters were included in the final refinement cycles of this structure.
The structure of CBM32-5 in complex with galactose was solved by molecular replacement using CpCBM32C from CpGH84C as a search model (PDB id 2j1e [27]) and MOLREP [47] to find the single molecule in the asymmetric unit. Automated model building was carried out with ARP/wARP followed by manual completion with COOT. This structure was used as a starting point to solve the structures of CBM32-5 in complex with other sugars. All refinements were carried out using REFMAC.
In all cases, waters were added using COOT:FINDWATERS. In all datasets 5% of the observations were flagged as ''free'' and used to monitor refinement progress. Final models were validated with MOLPROBITY [48]. Tables 3 and 4 show the data collection, refinement and final model validation statistics.
Modeling the CBM32-5 tetrasaccharide complex A 50 ns molecular dynamics (MD) simulation of the tetrasaccharide, GalNAca1-4(Fuca1-2)Galb1-4GlcNAcb with a reducing terminal methyl, was performed using the pmemd module of the AMBER11 software package [49]. The GLYCAM06g [50] force field was used for the tetrasaccharide parameters while the initial geometry was obtained from the GLYCAM carbohydrate 3D structure web tool [51]. The tetrasaccharide was explicitly solvated with 1724 TIP3P waters [52] and no ions. Minimization was performed for 20,000 steps, half of which used the conjugate gradient method followed by the steepest descent method. A 10.050 ns constant pressure MD (NPT) was used to ensure water and glycan equilibration in which the first 50 ps were used to heat the system from 5 K to 300 K. The final frame from equilibration Table 3. X-ray data collection and model refinement statistics for CBM32-5. was used to start the 50 ns NPT production simulation of the tetrasaccharide. In all tetrasaccharide simulations an 8.0 Å van der Waals cutoff was employed, particle mesh Ewald summation (PME) [53] was used for long range electrostatics, 1,4-scaling factors were set to unity, and a dielectric of 1.0 were employed. The Berendsen thermostat was used with a coupling constant of 1.0 ps. Pressure was maintained at 1 atm with a relaxation time of 0.1 ps. The SHAKE [54] algorithm was used to restrain the bonds to hydrogens reducing the time between steps to 2 fs. Production frames were collected at every ps and only the production run was used for further analyses.

Data collection statistics
The crystal structure of GalNAc-b-Serine bound to the CBM32-5 was used as a template for modeling the tetrasaccharide onto the complex. The GalNAc-b from the template crystal structure and the non-reducing terminal GalNAc-b from the MD simulation were aligned on the ring atoms (C1, C2, C3, C4, C5 and O5) using the alignment algorithm in VMD [55]. Then the MD trajectory of the entire tetrasaccharide was combined together with the template protein coordinates resulting in 50,000 snapshots of the solution tetrasaccharide bound to the crystal protein coordinates. Clashes were removed using a 2,000 step minimization, half conjugate gradient and half steepest descent, for each of the 50,000 complexes where the FF99SB force field [56] was used for the protein. The modified Onufriev, Bashford and Case generalized Borne implicit solvent was used [57] to approximate solvent effects in minimization. All minimizations in developing the CBM-tetrasaccharide complexes used mixed 1,4scaling, which set van der Waals and electrostatic scaling factors to 1.2 and 2.0, respectively, for the protein (consistent with FF99SB) and unity for the tetrasaccharide (consistent with GLYCAM06). Additionally, a 12.0 Å long-range van der Waals cutoff was employed with PME being used for long-range electrostatics.
The final net energy (including GB solvation contributions) of the CBM-tetrasaccharide complex was used to identify complexes within 15 kcal/mol of the lowest energy complex. This resulted in the selection of 42 complexes, which were further minimized using Table 4. X-ray data collection and model refinement statistics for CBM32-4 and CBM32-6. 10,000 steps of conjugate gradient and 10,000 steps of steepest descent minimization. These new complexes were then ranked according to their overall system energy and grouped together using a 1.0 Å cutoff in root mean squared deviation of the heavy atoms. The models were grouped such that reference structures were selected starting from the lowest energy and ending at the highest energy models. Structures grouped from the lowest energy clusters were excluded from subsequent root mean square deviation grouping analyses meaning any single representation could only belong to one group. Ten clusters were identified in which 60% of the complexes were in the two lowest energy groupings, 33% in the lowest energy group. Energy decomposition was performed on these ten clusters using the MMGBSA.py application in AMBER using the same implicit solvent model as in the minimizations.

Data collection statistics
Homology modeling of CBM32-3 and CBM32-2 Structural models of CBM32-3 and CBM32-2 were prepared using the one-to-one threading function of the Phyre2 server [58]. In both cases, the 1.55 Å resolution structure of CBM32-4 was used as a template.