Comprehensive analysis of lectin-glycan interactions reveals determinants of lectin specificity

Lectin-glycan interactions facilitate inter- and intracellular communication in many processes including protein trafficking, host-pathogen recognition, and tumorigenesis promotion. Specific recognition of glycans by lectins is also the basis for a wide range of applications in areas including glycobiology research, cancer screening, and antiviral therapeutics. To provide a better understanding of the determinants of lectin-glycan interaction specificity and support such applications, this study comprehensively investigates specificity-conferring features of all available lectin-glycan complex structures. Systematic characterization, comparison, and predictive modeling of a set of 221 complementary physicochemical and geometric features representing these interactions highlighted specificity-conferring features with potential mechanistic insight. Univariable comparative analyses with weighted Wilcoxon-Mann-Whitney tests revealed strong statistical associations between binding site features and specificity that are conserved across unrelated lectin binding sites. Multivariable modeling with random forests demonstrated the utility of these features for predicting the identity of bound glycans based on generalized patterns learned from non-homologous lectins. These analyses revealed global determinants of lectin specificity, such as sialic acid glycan recognition in deep, concave binding sites enriched for positively charged residues, in contrast to high mannose glycan recognition in fairly shallow but well-defined pockets enriched for non-polar residues. Focused fine specificity analysis of hemagglutinin interactions with human-like and avian-like glycans uncovered features representing both known and novel mutations related to shifts in influenza tropism from avian to human tissues. As the approach presented here relies on co-crystallized lectin-glycan pairs for studying specificity, it is limited in its inferences by the quantity, quality, and diversity of the structural data available. Regardless, the systematic characterization of lectin binding sites presented here provides a novel approach to studying lectin specificity and is a step towards confidently predicting new lectin-glycan interactions.


INTRODUCTION
Lectins, non-enzymatic, non-immunoglobulin, sugar-binding protein domains, selectively interact with 24 small subsets of the vast set of possible glycoforms and thereby facilitate diverse biological processes. 25 Minute differences in glycan structure can have profound impacts in associated biological processes. 26 For example, the difference between α2,3-linked and α2,6-linked terminal N-acetylneuraminic acid 27 (NeuAc) glycans serves as the primary barrier blocking avian influenza A from accessing cells in the upper 28 respiratory tract of humans, based on the specificities of the influenza hemagglutinin lectin [1; 2]. Specific 29 interactions between lectins and their cognate glycans play critical roles in many other host-pathogen 30 interactions [3] as well as an increasing number of known intracellular and extracellular biological more well-studied lectins from overly influencing studies of specificity. At each step of the analysis, 139 interaction weighting or sampling based on these clusters was applied to prevent disproportionate impact 140 from better-represented lectins in larger homology clusters. 141 To ensure a sufficient number of diverse interaction examples, only the the 12 most common unique 142 glycan ligands bound to lectins from different homology-based clusters are further considered, as well 143 as three classes of glycans likely to be specifically recognized by lectins: terminal NeuAc glycans, high 144 mannose glycans, and terminal fucose glycans ( Figure S3 & Figure S4A). The provided IUPAC identifiers 145 of the individual glycans comprising each class can be found in Table S1 -Table S3. Henceforth, the 12 146 individual glycans and three glycan classes are referred to as the 15 "glycans of interest" and displayed in 147 future figures in the order shown in Figure S4, arranged by glycan class and prevalence in complex with 148 different lectins. 149 Interactions between the lectins and glycans of interest ( Figure 1A) were represented by a compre- . For global specificity, binding interaction characteristics from each glycan of interest were compared to the background characteristics of all other lectin-glycan interactions, while for fine specificity, they were compared to those from interactions with a subgroup of very similar glycans. Descriptive and predictive statistical approaches to analyzing these features revealed those that were enriched or depleted in association with the presence of the given glycan (E). In panels A-D, the binding interaction between human lung collectin surfactant protein D and a bacterial lipopolysaccharide is used to demonstrate the three categories of interaction features (PDB ID: 4E52). Panel C has additional components illustrating featurization of the voxel point cloud via features describing the D2 distribution of pairwise distances between surface points and computed 3D Zernike descriptors (3DZDs), with the original point cloud in red and the reconstructed shape from the 3DZDs in blue. The barplot in panel E shows the mean value of selected features from each feature category for all interactions containing fucose monosaccharide (red bars) vs the averages from all other interactions (grey bars). Structures were rendered using PyMol and glycan symbols follow the Symbol Nomenclature for Glycans (SNFG) system.  Volcano plots show that a substantial proportion of features from all three categories are statistically significantly enriched (x > 0) and depleted (x < 0) in interaction characterizations for each of the 15 glycans of interest when compared to background interaction characterizations from all other glycans. It is apparent that pocket-size-correlated D2 distribution & pocket descriptor features (represented by the two lightest blue colored points) are generally enriched for larger glycan ligands (terminal NeuAc, high mannose, 3'-siayllactose) and depleted for interactions with smaller ligands (monosaccharide glycans). Some glycan-lectin interactions have fewer features that are strongly enriched (terminal fucose, N-acetyllactosamine, and TF antigen), possibly indicating a diversity of interaction mechanisms, or that more common, highly similar glycans in the background are reducing the strength of associations. Significance and direction of association was determined by weighted Wilcoxon-Mann-Whitney (WMW) tests accounting for homologous and redundant lectin structures. The x-axis shows the direction and strength of rank-based enrichment for each feature compared to background. The y-axis indicates the statistical significance (q-values) adjusted by the Benjamini-Hochberg procedure applied separately for each ligand. Q-values more significant than 1 × 10 −16 (horizontal dotted line) were scattered between 3 × 10 −19 and 1 × 10 −16 . The solid horizontal line denotes an FDR of 0.01, and the vertical line divides positive (right) and negative (left) associations. Glycan symbols follow the SNFG system. 215 While univariable comparative analysis revealed that there were indeed specific lectin binding pocket 216 features associated with specific glycan recognition, it did not (and cannot) characterize the extent to 217 which combinations of these features generalize to new cases and are thereby actually predictive of 218 which glycans a particular lectin will recognize. Thus multivariable predictive modeling, in particular 219 supervised classification, complements the univariable comparative analysis by demonstrating that in 220 some cases, particular feature combinations suffice to predict specific recognition. Here, the classification 221 goal was to train a glycan-specific model to label each lectin structure as "positive" (the glycan is actually 222 bound in the structure) or "negative" (a different glycan is bound), based on combinations of binding site 223 features learned from training data involving other, distinct interactions. Random forest (RF) classification 224 9/52 models [51] were used because of their interpretability as well as suitability for high-dimensional data 225 without detrimental impact from collinearity. RF models for each glycan of interest were validated with a 226 leave-one-out approach: binding-site structures from one of the homologous lectin clusters were withheld, 227 a model was trained on sampled dissimilar binding-site structures from the remaining lectins, and then 228 classification performance was evaluated on selected, dissimilar examples from the withheld structures. 229 We note that while a "negative" label could mean that the glycan and lectin do not interact, it could also 230 mean that, while the pair actually does interact, that interaction is solved in a different structure or the 231 structure has yet to be solved. For this reason, the prediction performance was evaluated separately for 232 both recall, the fraction of the lectin structures with that glycan bound that are correctly predicted to be 233 positive, and precision, the fraction of the lectin structures predicted to include that glycan that actually 234 do; note that these performance metrics are not impacted by the true negative rate. Additionally, models 235 were trained using F 2 scores to combine precision and recall with a greater weight on recall since recall 236 only accounts for positively labelled data.

237
The distributions of recall and precision for each glycan from repeated leave-one-out cross-validation 238 are represented by violin plots in Figure 3 and compared to the performance of corresponding "null model"

239
RFs trained and validated in the same manner but using interactions with shuffled glycan labels and thus 240 expected to display essentially random performance (shown as boxplots). The glycans' prevalences in 241 complex with different lectins were generally proportional to the amount of training data ( Figure S4B).  The RF models did very well for NeuAc terminal glycans, mannose monosaccharide, and fucose 252 monosaccharide, with all median recall values above 0.78 and median precision values above 0.71. For 253 these glycans, the associated lectin binding site features can be used to easily detect interactions, verifying 254 the value of these features in studies of specificity with some of the most predictive features shared by 255 these three models including the relative abundance of charged polar amino acids in the residues closest 256 to the glycans as well as 3D pocket features correlated with the size of the interaction site. In light of the 257 10/52 discussion above regarding positive/negative classification labels, the high precision of these models can 258 be interpreted to mean that the lectins binding these glycans are not often crystallized in complex with 259 other similar glycans, and most of the lectins capable of binding these glycans are crystallized with these 260 glycans.

261
In many cases, including lactose, galactose, N-acetylgalactosamine, 3'-sialyllactose, glucose, LacNAc, 262 and 2α-mannobiose, the glycan-specific models maintained high recall despite having lower precision, appear to be particularly predictive are discussed in the next section). However, the lower precision of 267 these models can be attributed to the same lectins appearing in complex with other glycans, particularly 268 with similar glycans. For glycans recognized by lectins that interact with other similar glycans, this effect 269 was more pronounced among the less prevalent glycans than among their more common counterparts, e.g.,

270
LacNAc had a median recall of 0.75 but also the lowest median precision (0.46), in contrast to lactose,  Figure S4).

273
Interestingly, the RF classifier for the high mannose glycan group had higher precision than recall, 274 with both outperforming the null model. While recall for the high mannose classifiers was slightly better 275 than that for the models trained on shuffled labels, median precision was the 4 th highest. This might  Random forest models trained for each of the 15 glycans have strong recall performance while predicting whether interactions contain the respective glycan based on the interaction features alone. The models are predictive of glycan identity even when trained only on lectins with less than 50% sequence identity, outperforming identical classifiers trained on data with shuffled labels. Split violin plots show the recall (left-hand distribution and left y-axis) and precision (right-hand distribution and right y-axis) of ligand-specific random forest models measured during leave-one-out cross-validation. The pairs of notched boxplots for each glycan show the performance of classifiers trained on data with shuffled labels, where again the left-hand boxplots depict recall and the right-hand boxplots depict precision. Glycan symbols follow the SNFG system. 289 Lectin interaction features that were both significant and highly predictive across diverse lectins and 290 conserved across interactions with similar glycans are likely to play a role in facilitating lectin specificity.

291
By integrating the discovered features from the comparative and predictive approaches, we thus aim to 292 obtain higher confidence in the identified features and a better basis for deriving possible explanations 293 for trends seen in the analyses.   Features that are statistically significant by the weighted WMW tests (q < 0.01) and in at least the 75 th percentile of median feature-type-stratified importance from the random forest models are indicated with bullet points. The color bars present along the columns indicated the subcategory of the feature and the parameter threshold used when extracting the feature. The color bars along the rows indicate the identity of the terminal saccharide in the glycan and the number of saccharides present. Clusters discussed include sialic acid glycans (purple boxes), mannose and glucose (cyan boxes), lactose and N-acetyllactosamine (orange boxes), and fucose and terminal fucose containing glycans (red boxes). Interestingly, N-acetylglucosamine interactions are more similar to interactions with galactose while N-acetylgalactosamine interactions are more similar to interactions with glucose. The dark green boxes indicate distinct patterns in the 3D pockets of interactions with high mannose. Glycan symbols follow the SNFG system.
(ρ = 0.31, p < 0.001), and a general trend of glucose clustering with at least one of the three 338 mannose glycans for each feature type, somewhat intuitively since mannose is a C-2 epimer of 339 glucose. In summary, this mannose-glucose clustering appeared to be driven by very similar 340 interaction pockets for 2α-mannobiose and glucose, depletion of β strands and polar residues in 341 favor of enriched non-polar residues, 3/10 helices, as well as loop structure, and a general depletion 342 of all other interaction types except for backbone hydrogen bonds. Taken together, this paints a 343 picture of mannose recognition requiring specific secondary structure arrangement to coordinate 344 backbone hydrogen bonding with primarily non-polar amino acids. relying on polar residues in close proximity to the glycan and tyrosine/aromatic residues slightly further 367 away, often found within β strands, to coordinate numerous hydrophobic contacts with the glycan. 368

15/52
Size-correlated collinear features still differentiate between large lectin pockets with different specificities. As highlighted above, the 3 sialic acid glycans of interest in this study were recog-370 nized by lectins in similar ways. However, it is remarkable that sialic acid monosaccharides had 3D pocket 371 features that were very similar to those of the other much larger sialic acid glycan ligands ( Figure 4D   , and high mannose (panel C, PDB ID: 1CVN) demonstrate the differences in the 3D interaction site space between NeuAc-binding lectins and high-mannose-binding lectins. Panel D shows the D2 distributions summarizing pocket geometry for each of these representative interactions. The lectin binding sites containing sialic acid glycans are wider and more concave while the high-mannose-accepting binding sites are more shallow and compact, being nearly entirely defined by the lowest threshold used for pocket generation as seen in the inset subpanels in A-C and in the D2 distributions in panel D. In panels A-C, residues are colored by their binned distance from the glycan (red: bin 1, orange: bin 2, sand: bin 3, pale yellow: bin 4), the glycan is colored by atom-type with carbons in white, and the rest of the lectin structure is in grey. PLIP interactions are colored blue for hydrogen bonds, pale blue for water bridges, yellow for electrostatic interactions, and grey for hydrophobic interactions. In the insets, 0.5Å 3 spheres were placed at each voxel center in the pocket and colored by the distance threshold used (magenta/red/orange/yellow: 4/6/8/10Å). In panel D, vertical lines were placed at the median D2 measure from each threshold with the same coloring as used from the insets in panels A-C. All structures were rendered in PyMol and glycan symbols follow the SNFG symbols.

17/52
their opposing non-acetylated counterparts (ρ = 0.37, p < 0.001 for Glc/GalNAc, ρ = 0.32, p < 0.001 for  To allow for more complete and continuous lectin binding sites, the binding sites were expanded from 644 PLIP-defined binding residues to include two residues on each side of an interacting residue as well as 645 residues immediately between two binding site residues. Secondary structure, backbone angles, and 646 solvent accessibility information for binding site residues were calculated from the structure files using

648
Binding site residues were binned by distance from the glycan to mitigate the high likelihood that the 649 observed interacting residues in the crystal structure are not the only residues contributing to interactions 650 stemming from the flexibility of proteins and especially glycans, the potential for the solved conformation was included in the first bin to store the number of Ca 2+ ions present within 3Å of the glycan [78].

Homology between lectins 664
To generate consistent lectin sequences from each structure, the protein sequences from each chain were values of q < 10 −16 were considered extremely significant such that further increased significance is not meaningfully interpretable, so to improve visualization in Figure 2 these values were replaced with a 695 random significance value sampled from a log-uniform distribution between 1 × 10 −16 and 3 × 10 −19 . Since, as discussed, negative labels are not always meaningful in this data (perhaps that exact structure 710 has not yet been solved) and recall and precision do not rely on the true negative rate, training performance 711 was assessed using the harmonic mean of recall and precision known as the F β score where β was set to 712 2, placing more weight on recall performance (r) compared to precision (p) as recall only uses positively 713 labelled data. To account for variation in the random sampling of test cases for each validation step of this approach, 724 the sampling and prediction was repeated 10 times at each iteration of the leave-one-out cross-validation.

725
To additionally account for variation in the training data and the stochasticity of the RFs, the leave-one-out 726 cross-validation approach was repeated 10 times for each glycan of interest. As a result, 100 samples of 727 RF performance were measured for each glycan-specific RF classifier and displayed in Figure 3. The 728 "null model" random classifier was built, trained, and tested exactly the same way except the labels for leave-one-out cross-validation was computed for features within their respective categories. Features 737 whose median stratified importance rank was at least in the 75 th percentile were considered to be highly 738 predictive.

739
Glycan recognition similarity analysis 740 To identify similar determinants of specificity in lectins recognizing different glycans, the common effect 741 size values of the features from the weighted WMW tests for each glycan of interest were correlated using 742 Pearson correlation and the glycans were hierarchically clustered with complete linkage using correlation 743 as the distance metric. Heatmaps were generated using pheatmap v1.0.12 [90]. Included supplementary figures and tables: Figure S1. Non-redundant lectin sequence extraction. Schematic of the workflow used to extract non-redundant sequences from structure files. Figure S2. Histogram of the counts of unique lectins within each homology cluster. The number of unique lectins (as defined by UniProtID) within each homology cluster generated with CD-HIT at 50% sequence identity. Most homology clusters only contained 5 or fewer unique lectins, but some very well studied lectins and homologous lectins were grouped into very large homology clusters.  Figure S5. Glycan-specific RF model training performances. Training performance of glycan-specific RF models measured with nested 5x cross-validation. Recall (left y-axis) and precision (right y-axis) of glycan-specific random forest models is shown by the split violin plots, with the left-hand distributions depicting recall and the right-hand distributions depicting precision. The pairs of notch boxplots for each glycan show the performance of the random classifiers trained on data with shuffled labels, where again the left-hand boxplots depict the random classifiers' recall and the right-hand boxplots depict their precision. Figure S6. Glycan-specific RF model training performance is correlated with the number of training examples. Training performance of glycan-specific RF models summarized by F 2 scores combining recall and precision with greater emphasis on recall, plotted against the number of samples used in training each specific model. Glycan labels are placed on the mean F 2 and sample numbers for each glycan. Training nested-cross-validation performance is fairly correlated with the number of samples available for training (Pearson correlation ρ = 0.50, p < 0.001). Figure S7. Median feature importance percentiles from glycan-specific RF models. Median feature importance percentiles from each glycan-specific RF model determined via mean decrease in Gini impurity. Size-correlated pocket features (blue) were often grouped together at a higher importance level, motivating the stratification by feature type to prevent multicollinearity from one feature type affecting other features. Points were colored with the same color scheme detailed in Figure 1, Figure 2, Figure 4, & Figure 6.

47/52
Figure S11. Feature-type stratified percentiles of RF feature importances are generally correlated with the degree of feature enrichment from the WMW test. Median ranked feature importance percentiles (stratified by feature type) are plotted together against the absolute value of the weighted WMW effect size. In general, the stronger the observed association from the WMW test for a given feature, the more likely the feature was to be highly predictive. The dotted horizontal line indicates the 75 th percentile threshold. Points that are bolded represent features that passed the 75 th percentile for feature importance and were found to signficant from the weighted WMW test at q < 0.01 following the Benjamini-Hochberg procedure. Points were colored with the same color scheme detailed in Figure 1, Figure 2, Figure 4, & Figure 6. Bullet points indicate q < 0.01 by Benjamini-Hochberg correction. The first three rows show the associations for the three defined NeuAc glycans compared to background from Figure 4 for ease of comparison.

48/52
Figure S13. Valine at position 155 in H1 appears to be a previously uncharacterized mutation associated with 6' sialoglycan specificity. Valine appears at position 155 (H1 numbering) in H1, H10, and type B hemagglutinin structures, as shown by the multiple sequence alignment (Clustal Omega), visualized with Seaview, and clustered by global sequence identity (BLOSUM 62), as show in panel A. Panel B shows the distributions of the measured minimum distance from any atom in the residue at position 155 to the closest heavy glycan atom (usually the terminal carbon of the N-acetyl group) within all HA structures from H1N1. When complexed with 3' sialoglycans, threonine is usually oriented closer to the glycan compared to valine, and has a hydrophobic contact with the sugar in two of the structures (compared to one structure when valine is present). When complexed with 6' sialyoglycans, valine is more tightly grouped closer to the glycan and has one observed hydrophobic interaction with the glycan while threonine has no contacts. Table S3. UniLectin3D-assigned IUPAC glycan names within the terminal fucose group of glycans.