A study of the structural properties of sites modified by the O-linked 6-N-acetylglucosamine transferase

Protein O-GlcNAcylation (O-GlcNAc) is an essential post-translational modification (PTM) in higher eukaryotes. The O-linked β-N-acetylglucosamine transferase (OGT), targets specific Serines and Threonines (S/T) in intracellular proteins. However, unlike phosphorylation, fewer than 25% of known O-GlcNAc sites match a clear sequence pattern. Accordingly, the three-dimensional structures of O-GlcNAc sites were characterised to investigate the role of structure in molecular recognition. From 1,584 O-GlcNAc sites in 620 proteins, 143 were mapped to protein structures determined by X-ray crystallography. The modified S/T were 1.7 times more likely to be annotated in the REM465 field which defines missing residues in a protein structure, while 7 O-GlcNAc sites were solvent inaccessible and unlikely to be targeted by OGT. 132 sites with complete backbone atoms clustered into 10 groups, but these were indistinguishable from clusters from unmodified S/T. This suggests there is no prevalent three-dimensional motif for OGT recognition. Predicted features from the 620 proteins were compared to unmodified S/T in O-GlcNAcylated proteins and globular proteins. The Jpred4 predicted secondary structure shows that modified S/T were more likely to be coils. 5/6 methods to predict intrinsic disorder indicated O-GlcNAcylated S/T to be significantly more disordered than unmodified S/T. Although the analysis did not find a pattern in the site three-dimensional structure, it revealed the residues around the modification site are likely to be disordered and suggests a potential role of secondary structure elements in OGT site recognition.


Introduction
Protein O-GlcNAcylation, or O-GlcNAc, is a dynamic, intracellular glycosylation essential to mammalian development [1,2]. In animals, two enzymes mediate this post-translational modification: the glycosyltransferase O-linked 6-N-acetylglucosamine transferase (OGT), which adds a single, non-extensible O-GlcNAc moiety to serine/threonine (S/T) in the target protein; and the hexosaminidase O-GlcNAcase (OGA) that removes it. UDP-GlcNAc, the sugar donor to the protein O-GlcNAcylation, is a product of the hexosamine pathway, hence the concentration of intracellular glucose and the degree of protein O-GlcNAcylation levels are associated [3,4]. At the physiological level, dysfunction of OGT activity has been linked to disease of the PLOS ONE | https://doi.org/10.1371/journal.pone.0184405 September 8, 2017 1 / 14 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 cardiovascular system, diabetes, impaired development, cancer and neurodegeneration [5][6][7][8][9]. At the cellular level, protein O-GlcNAcylation acts with phosphorylation, ubiquitylation and other reversible post-translational modifications in a network of cell signalling events that promote cellular adaptation to the viral infection process [10], regulation of transcription [11] and metabolism [12,13]. Technical advances in mass spectrometry have led to an increase in the number of experimentally determined O-GlcNAc sites from 50 in the year 2000 to more than 1,000 today [14]. However, there are still obstacles to mapping O-GlcNAc sites reliably. The modification has a low abundance [15] and is ten times less common than protein phosphorylation [16]. Thus, the unmodified version of the peptide can suppress the O-GlcNAcylated peptide mass/charge signal. In addition, methods to enrich O-GlcNAcylated peptides in samples have limited specificity [16,17], and the β-glycosidic bond is labile under the peptide fragmentation step which determines the modification's position within the peptide fragment.
Two machine learning methods have been used to detect patterns in the sequence of O-GlcNAc sites [18][19][20] with limited success [21]. Newer predictors have exploited more complex machine learning approaches to classify potential novel sites [22][23][24] but to date have only been applied in a few studies. One of the limiting factors for site prediction is that, unlike phosphorylation sites, O-GlcNAc sites lack a clear pattern in the primary structure. This is illustrated in Fig 1 which compares the relative sequence entropy for sites modified by OGT and three protein kinases in the PhosphoSitePlus database [14]. The relative sequence entropy, calculated with the WebLogo library [25], describes the amount of information carried per position compared to the background amino acid distribution. OGT sites shows no peaks other than the modified S/T, in contrast to protein kinase A (PKA; peak in -3 and +2), protein kinase C (PKC; peak in -3) and casein kinase 2 (CK2, peak in +3) sites. This implies that the sequence in the sites recognised by OGT carries less information than those recognised by PKA, PKS or CK2 and so are harder to distinguish from unmodified sites by sequence alone.
OGT activity measured on peptide libraries demonstrate the enzyme substrate specificity and that point mutations near to the targeted S/T abolish peptide modification [26][27][28]. The crystal structure of OGT in a ternary complex with UDP-GlcNAc and a peptide substrate revealed that the OGT and the peptides' residues predominantly make contact via the peptide backbone [29,30]. This fact reduces the importance of the peptide side chain in the enzyme active site, the cleft where the reaction occurs. A short structural motif, instead of sequence motif, could work as a point of molecular recognition even with a degenerate sequence. Accordingly, in this paper, the three-dimensional structures of S/T OGT substrates were examined to determine if they have distinct structural motifs and patterns of secondary structure or solvent accessibility. In addition, the predicted secondary structure and disorder were compared for known OGT substrates and S/T unlikely to be modified.

Data sources
The data selection process is summarised in Table 1 and Fig 2. A total of 1,533 modified sites from 676 proteins were selected by combining proteins curated from the literature up until 2011 [18] and from 2011-2013 [21]. The majority of the sites were obtained from highthroughput experiments in mammalians. The sites were filtered to keep 7-residue long motifs with unique sequences. The resulting dataset contained 1,385 sites in 620 proteins. This dataset is referred to hereafter as the "modified sequence sites" (MSS). For comparison, 100,329 S/T from the same proteins, but not annotated as OGT-modified, were selected as a background and are referred to here as the "unmodified sequence sites" (USS). S1 Table [

Mapping O-GlcNAc sites to protein structures
Protein chains > 30 residues long from structures determined by X-ray crystallography to 2.50 Å resolution were selected from the Protein Data Bank [31] (PDB: 2 nd August of 2015). Mapping the 1,385 OGT sites from 620 proteins to PDB structures by SIFTS [32] located 45 OGT sites in 24 proteins of known structure. The structures of a further 107 sites were identified by searching the sequences of O-GlcNAcylated proteins against the PDB chains with BLAST and filtering by a conservative E-value 10 −25 to minimise erroneous matches. The cutoff of 10 −25 was found empirically to ensure the reliability of the match in the region of each site by inspecting all alignments between query and PDB sequence at different thresholds. Three kinases with most sites in PhosphoSitePlus database [14] protein kinase A (PKA with 1285 sites), protein kinase C (PKC with 930 sites) and casein kinase 2 (CK2 with 742 sites). 1530 OGT sites were compiled from the same database. The sequence relative entropy was calculated with the WebLogo library [25]. Lines show mean relative entropy and the semi-transparent area represents 95% confidence intervals.

Site definition and clustering
The three-dimensional structure of OGT with its substrates suggests the region of contact between OGT and a modifiable S/T includes the residues and +/-3 amino acids either side [29,30]. From the structural sites returned in Mapping O-GlcNAc-sites to protein structures, "132 Structural Sites" (hereafter SS132) had at least one match with all backbone atoms for the 7-residue long site and were retained for further analysis. Cα atoms of each residue and the Cα and the Cβ for the central S/T were superimposed for all pairs of sites. Hierarchical clustering by complete linkage was applied on the resulting matrix of root-mean-square deviation (RMSD) values and clusters selected where all pairs of peptides were within 3 Å RMSD of each other.

Structural properties of sites
Protein secondary structure assignments were obtained from DSSP [33]. DSSP annotates 7 different secondary structure states: 3 10 helix (G), α helix (H), π helix (I), bends (S), turns (T), isolated (B) and extended (E) β-bridge. These assignments were reduced to three states: G and H to helices (H); I, B and E to strands (E); and all other, including residues with no assignment, to coils (C) [34]. The solvent accessible area from DSSP was normalised by the residue's maximum accessible area [35]. A S/T was considered exposed if its relative solvent accessibility (RSA) was > 25%; partially buried if the RSA > 5% and 25%, and buried if RSA 5%. Cα B-factors were standardised (Z-score normalised) over the B-factors for all Cα in the same chain.

Prediction of protein disorder and secondary structure
Protein secondary structure predictions for the proteins in the MSS dataset were performed by JPred4 [36]. Since JPred4 limits sequence longer than 800 residues, 300 of the sequences in the MSS dataset sequences were trimmed while ensuring the modified S/T was at least 100 residues away from the N-and C-termini to avoid edge effects. The intrinsic disorder was predicted by JRonn (Java implementation of Ronn [37]), IUPred [38] and DisEMBL [39] through the JABAWS [40] command line application. Between them, these methods provide 6 different disorder prediction scores: DisEMBL-REM465 (0.6), DisEMBL-COILS (0.516), DisEMBL-HOTLOOPS (0.1204), IUPred-Long (0.5), IUPred-Short (0.5) and JRonn (0.5). The score ordered/disordered classes were defined by the cut-offs (in parenthesis) defined by the methods' authors. Disorder predictions were also performed on a background set of 1,164 S/T selected at random from globular proteins in the Astral dataset [41] version 2.04, referred to hereafter as the "Globular Set" (GS).

Statistical analysis and code
The data collection, processing, analysis and the Cα clustering steps, were written in the Python programming language (Python Software Foundation, version 2.7 http://www.python. org) with the libraries Pandas (version 0.17) [42] and Biopython (version 1.65) [43]. Statistical tests were performed with the StatsModels (version 0.6) and Scipy (version 0.16) libraries. A p value (p) threshold was set to 0.05.

Results and discussion
Analysis of O-GlcNAc sites in proteins of known three-dimensional structure Previous reports have suggested that O-GlcNAc sites, like phosphorylation sites, are predominantly present in disordered regions of proteins [44]. One indication of structural disorder is the crystallographic B-factor which indicates regions of the protein that lack crystallographic contacts. However, the standardised B-factor distribution on the SS143 dataset is the same for modified and unmodified S/T (Kruskal-Wallis two-sample test p = 0.12). In X-ray crystal structures, the REM465 residue annotation indicates residues that are missing from the protein structure model and has previously been used as an indicator of structural disorder [39]. Of the 143 S/T in the SS143 dataset, 26 are in regions of the protein structure labelled as REM465. In comparison, 553 of 4,811 unmodified S/T from the same protein structures are also found in REM465 regions. Accordingly, O-GlcNAcylated S/T in these proteins are 1.7 times more likely to be in REM465 regions (Fisher's exact test p = 0.02). This finding is consistent with O-GlcNAcylated S/T occurring more frequently in disordered or highly flexible regions. Table 2 summarises the DSSP assigned secondary structure for the SS143 compared to the 4,811 unmodified S/T in the same proteins. The proportions of H, E and C are equivalent for the two groups implying that there is no preference in the secondary structure for modified S/ T in this dataset.
Residues that are buried in the protein structure are not thought to be targeted by protein kinases, due to structural constraints. Fig 3 illustrates that there is no difference between modified and unmodified S/T with respect to relative solvent accessibility (RSA). 45% of 65 S/T in the O-GlcNAcylated proteins are exposed to solvent (RSA > 25%). Surprisingly, 7 O-GlcNAc sites, listed in Table 2, have an RSA < 5%, suggesting they are inaccessible to OGT in the natively folded protein.

Groups of sites with similar local structure
Since the secondary structure and relative accessibility of modified S/T were indistinguishable from unmodified S/T, the local structure of the 7 residue peptides centred on S/T was investigated by pairwise superposition and clustering (see Methods). 36 sites produce singlet clusters, where the majority of the residues are in C, while the remaining 96 sites fall into 10 clusters. Sites in clusters had less than 3 Å RMSD from each other. Fig 4 illustrates the superimposed structures for sites in clusters, where green, yellow and grey represent residues in H, E, C secondary structures, respectively. The clusters show that sites are found in a wide range of secondary structure states as summarised in S1 Table. The sites in Clusters E, G and J, have consistent consensus secondary structures. Clusters A-D, F, H and I are all variants on coilhelix or coil-strand transitions.
The buried sites, which are listed in Table 3, group in clusters D and G. The 3 sites in cluster D are unlikely to be targeted by OGT because they are buried in the protein core. In contrast, the 2/4 sites in cluster G (structures 3abm and 4y7y) might be modified since are located at a dimer interface, and so the monomer could be modified. The remaining two sites in cluster G (structures 2zxe and 4l3j) lie on a loop that could potentially move to expose them to OGT.
To see if the clusters found for the SS132 dataset are features of O-GlcNAc modification or just reflect the composition of the protein structures, 132 sites, centred on unmodified S/T, were randomly sampled with replacement from the same proteins and clustered. The process was repeated 1,000 times and the resulting clusters compared to those clusters in the SS132 dataset. The number of clusters identified in each sample ranged from 10-14 (95% CI), which is consistent with the SS132 dataset. Furthermore, the structural clusters identified for the random sampling included structural clusters similar to those for the modified sites, suggesting there are no dominant secondary structural or conformational patterns indicative of O-GlcNAc modified sites in the SS132 dataset. The analysis was also extended longer peptides with 20 residues either side of the modified S/T, but the structural clustering showed high heterogeneity for 41-residue peptides and no clear patterns were identified. Analysis of features predicted for the "modified sequence sites" dataset (MSS) Since the structural analysis of O-GlcNAc sites is limited by the number of sites in proteins of known three-dimensional structure, prediction algorithms were applied to the sequences in the MSS and USS datasets, as detailed in Methods. The proportions of S/T in the levels of solvent accessibility predicted by JPred are equivalent in the MSS and USS datasets, as shown in Table 4. 1% of the S/T are predicted to be buried in the MSS and USS datasets. Again, the result is unexpected, since sites modified by PTM are thought to be accessible in the protein native fold. While the structural sites in the SS143 dataset have equal proportions of the secondary structure states, the result from secondary structure predictions on the MSS set showed that O-GlcNAc sites are likely to reside in coils, if compared to the USS dataset. Table 5 shows an increase of the proportion of modified S/T in C (p < 0.01) and a corresponding reduction in H (p < 0.01), but no change in E (p = 0.6). The enrichment of sites in C is consistent with the need to place modified S/T in loops that are more likely to be mobile and so more accessible to OGT. The proportions of secondary structure assigned by DSSP and predicted by JPred4 differ. While secondary structure prediction has limited accuracy, the number of samples in the SS143 dataset is limited and potentially biased toward structured regions in proteins. Also, clustering sites in the SS132 dataset highlight groups that are more likely to occur near to the transition between a secondary structure element and C, as observed in several members of clusters A-D, F and H. The regions of transition between C and H/E are RSA of modified S/T in the SS143 dataset and unmodified S/T in same proteins. DSSP calculated solvent accessibility was normalised by the residue theoretical maximum accessibility and the derived scores were reduced to three levels: buried (RSA 0.05), partially buried (0.05 < RSA 0.25) and exposed (RSA > 0.25) levels. The y-axis and x-axis carry the RSA levels and the RSA distribution for each level, respectively. The mean RSA is equivalent between modified and unmodified residues, at all three levels.   Structural characterisation of O-GlcNAc sites harder to predict than contiguous secondary structure elements, and this may also contribute to the observed enrichment in C. The analysis of SS143 dataset showed an enrichment of S/T in REM465 regions likely to be disordered or highly mobile. To explore this further, 3 disorder prediction algorithms, giving a total of 6 disorder scores, were run on the MSS and USS datasets as detailed in Methods. Table 6 shows that, with the exception of DisEMBL-HOTLOOPS which is trained structural B-factors, all methods report a small but significant increase in mean predicted disorder for the modified S/T. To confirm this result, the MSS dataset was compared to the GS dataset, which was selected from proteins known to be predominantly globular, and hence an ordered background. In Fig 5, DisEMBL-HOTLOOPS shows an increase in the ratio of disordered residues around the modified S/T. DisEMBL-COILS and JRonn also indicate a small increase, not in a specific region, but rather for 40 residues around the S/T. IUpred-Long, IUPred-Short and DisEMBL-REM465 show a bigger increase of the ratio of disordered residues in the MSS dataset and IUpred-Short and REM465 have a clearer peak within -15 to 15 residues from the modified S/T. Overall, all methods indicate an increased proportion of predicted disorder in the MSS dataset when compared to the GS dataset.

Conclusions and final remarks
Despite the substantial evidence of protein structural disorder in the MSS and the SS143 datasets, the SS132 dataset clearly indicates that some of the examined sites appear within ordered regions of the protein structure. Furthermore, InterproScan [45] analysis of O-GlcNAc sites assigned 19% of the sites to protein domains, this is similar to with the 25% phosphoserines and phosphothreonines in PFAM domains [14,46], which are thought to be mostly ordered by definition. So, like protein phosphorylation, O-GlcNAcylated S/T are found in both ordered and disordered regions. The local tertiary structure of O-GlcNAc sites is indistinguishable from unmodified sites, and so how does OGT recognise the site it modifies? OGT may force the unfolding of the targeted substrate [26]. Moreover, OGT participates in macromolecular assemblies [47], and the role of adaptor proteins cannot be ignored. In protein kinase C (PKC) substrate recognition, residues distant in the protein sequence but close in its three-dimensional structure are critical [48] and non-local interactions might also act in OGT substrate recognition. Other components, such as UDP-GlcNAc concentration and subcellular location-dependent interactions, modulate OGT activity [49], but their part in substrate recognition is still unknown. In Table 6. Predicted disorder between modified and unmodified S/T. All disorder prediction methods, excepting DisEMBL-HOTLOOPS, reveal a small but significant increase of mean disorder score for modified S/T over unmodified ones.

Method
Mean score modified (MSS) ± SE Mean score unmodified (USS)± SE p value The p value refers to the two-tailed t-test between the modified and unmodified groups. SE-standard error.
https://doi.org/10.1371/journal.pone.0184405.t006 The y-axis shows the log 10 odds ratio of the between the proportion of disordered residues in the MSS dataset and the proportion of disordered residues in the GS dataset. The semitransparent area represents 95% confidence intervals. A residue was defined as disordered according to each method's threshold. The x-axis represents the distance in residues to the central residue which is always a S/T. DisEMBL-REM465, IUpred-short predict protein structural disorder specifically around the modification site, while the other methods predict intrinsic disorder over O-GlcNAcylated proteins. DisEMBL-REM465 shows a less pronounced increase in predicted disorder compared to the other methods. https://doi.org/10.1371/journal.pone.0184405.g005 Structural characterisation of O-GlcNAc sites conclusion, although no three-dimensional fingerprint was detected during the structural characterisation of OGT-modified sites, the work confirmed that S/T and surrounding residues are more disordered than the backgrounds tested and that sites in transition between C to H/E might be involved, suggesting that the structural flexibility has a role on OGT site recognition.
Supporting information S1 Table. Properties of sites in the SS132 dataset. List of all entries in the SS132 dataset. PDB, PDB accession number; Chain, chain in the PDB file; Position, residue position within the chain; Cluster, cluster id. RSA, relative solvent accessibility; SS, secondary structure. (CSV)