Regulatory context drives conservation of glycine riboswitch aptamers

In comparison to protein coding sequences, the impact of mutation and natural selection on the sequence and function of non-coding (ncRNA) genes is not well understood. Many ncRNA genes are narrowly distributed to only a few organisms, and appear to be rapidly evolving. Compared to protein coding sequences, there are many challenges associated with assessment of ncRNAs that are not well addressed by conventional phylogenetic approaches, including: short sequence length, lack of primary sequence conservation, and the importance of secondary structure for biological function. Riboswitches are structured ncRNAs that directly interact with small molecules to regulate gene expression in bacteria. They typically consist of a ligand-binding domain (aptamer) whose folding changes drive changes in gene expression. The glycine riboswitch is among the most well-studied due to the widespread occurrence of a tandem aptamer arrangement (tandem), wherein two homologous aptamers interact with glycine and each other to regulate gene expression. However, a significant proportion of glycine riboswitches are comprised of single aptamers (singleton). Here we use graph clustering to circumvent the limitations of traditional phylogenetic analysis when studying the relationship between the tandem and singleton glycine aptamers. Graph clustering enables a broader range of pairwise comparison measures to be used to assess aptamer similarity. Using this approach, we show that one aptamer of the tandem glycine riboswitch pair is typically much more highly conserved, and that which aptamer is conserved depends on the regulated gene. Furthermore, our analysis also reveals that singleton aptamers are more similar to either the first or second tandem aptamer, again based on the regulated gene. Taken together, our findings suggest that tandem glycine riboswitches degrade into functional singletons, with the regulated gene(s) dictating which glycine-binding aptamer is conserved. Author Summary The glycine riboswitch is a ncRNA responsible for the regulation of several distinct gene sets in bacteria that is found with either one (singleton) or two (tandem) aptamers, each of which directly senses glycine. Which aptamer is more important for gene-regulation, and the functional difference between tandem and singleton aptamers, are long-standing questions in the riboswitch field. Like many biologically functional RNAs, glycine aptamers require a specific 3D folded conformation. Thus, they have low primary sequence similarity across distantly related homologs, and large changes in sequence length that make creation and analysis of accurate multiple sequence alignments challenging. To better understand the relationship between tandem and singleton aptamers, we used a graph clustering approach that allows us to compare the similarity of aptamers using metrics that measure both sequence and structure similarity. Our investigation reveals that in tandem glycine riboswitches, one aptamer is more highly conserved than the other, and which aptamer is conserved depends on what gene(s) are regulated. Moreover, we find that many singleton glycine riboswitches likely originate from tandem riboswitches in which the ligand-binding site of the non-conserved aptamer has degraded over time.


Abstract: 24
In comparison to protein coding sequences, the impact of mutation and natural selection 25 on the sequence and function of non-coding (ncRNA) genes is not well understood. 26 Many ncRNA genes are narrowly distributed to only a few organisms, and appear to be 27 rapidly evolving. Compared to protein coding sequences, there are many challenges 28 associated with assessment of ncRNAs that are not well addressed by conventional 29 phylogenetic approaches, including: short sequence length, lack of primary sequence 30 conservation, and the importance of secondary structure for biological function. 31 Riboswitches are structured ncRNAs that directly interact with small molecules to 32 regulate gene expression in bacteria. They typically consist of a ligand-binding domain 33 (aptamer) whose folding changes drive changes in gene expression. The glycine 34 riboswitch is among the most well-studied due to the widespread occurrence of a 35 tandem aptamer arrangement (tandem), wherein two homologous aptamers interact 36 with glycine and each other to regulate gene expression. However, a significant 37 proportion of glycine riboswitches are comprised of single aptamers (singleton). Here 38 we use graph clustering to circumvent the limitations of traditional phylogenetic analysis 39 when studying the relationship between the tandem and singleton glycine aptamers. 40 Graph clustering enables a broader range of pairwise comparison measures to be used 41 to assess aptamer similarity. Using this approach, we show that one aptamer of the 42 tandem glycine riboswitch pair is typically much more highly conserved, and that which 43 aptamer is conserved depends on the regulated gene. Furthermore, our analysis also 44 reveals that singleton aptamers are more similar to either the first or second tandem 45 aptamer, again based on the regulated gene. Taken together, our findings suggest that 46 4 (e.g. RNA structural similarity, sequence similarity, or some combination of both), to 93 assess the relative similarity of related RNAs. A graph-based clustering method was 94 recently utilized to cluster contigs belonging to the same transcripts in de novo 95 transcriptome analysis (11), but it lacks the structural component necessary for accurate 96 clustering of structured ncRNA sequences. Other approaches for detection and 97 clustering of structured RNAs provide robust methods for researchers to classify 98 members of existing RNA families (12-15) and identify novel structured ncRNA (16)(17)(18)(19)(20). 99 However, these tools are fundamentally not designed to compare degrees of 100 conservation across clusters of homologous structured RNAs. 101 To investigate motif divergence an analysis approach that provides fine-scale 102 comparison of similarity within and between groups of homologous RNAs is necessary. 103 Our approach allows any measure of pairwise similarity between two RNA sequences to 104 be utilized for comparison and relative conservation between groups to be investigated. 105 Thus, measures that incorporate purely secondary structure information (21,22), purely 106 sequence information (23,24), or a combination of both (25-29), may be utilized so that 107 all available information can be captured. While this approach does not yield the same 108 kind of inferences concerning the line of descent as traditional phylogenetic approaches, 109 it does enable many more variables to be accounted for while assessing ncRNA similarity. 110 Regulatory RNAs have quickly become the largest class of functional RNAs. Yet, 111 our understanding of their evolution lags that of the larger catalytic RNAs such as the 112 ribosome. One class of regulatory RNAs that, similarly to the ribosome, take their function 113 from a three-dimensional structure are riboswitches. Riboswitches are bacterial cis-114 regulatory elements that occur within the mRNA 5'-UTR and alter transcription attenuation 115 7 and gene expression, while the first aptamer (aptamer-1) primarily played a role in 162 structural stabilization and aptamer dimerization (39). However, in vivo investigation of a 163 tandem glycine riboswitch within Bacillus subtilis found that disruption of aptamer-1's 164 binding pocket impeded riboswitch regulation more strongly than disrupting aptamer-2's 165 (40). To resolve the differences observed between the V. cholerae and B. subtilis tandem 166 riboswitches, we conducted a comprehensive sequence analysis of glycine aptamers. 167 To identify glycine riboswitch aptamers, we used the RFAM covariance model 168 one containing all aptamer-1's (first aptamer) and one containing all aptamer-2's (second 203 aptamer). We then generated a phylogenetic tree for each set to determine whether the 204 phylogenetic divergence seen within the riboswitch set is explained by variances within 205 one specific aptamer or is present in both (Figure 1B, C). Both aptamer-1 and aptamer-206 2 sets display clear clustering based on genomic context. This indicates that divergence 207 of tandem glycine riboswitches in differing genomic contexts cannot be fully explained by 208 variation within the first or second aptamer alone. Moreover, within the GCV context, it 209 appears that aptamer-1 is more highly conserved than aptamer-2, as indicated by the 210 shorter branch lengths across the clade (Figure 1B, C). 211 212 Genomic context dictates aptamer clustering in Bacillaceae and Vibrionaceae. 213 To better understand how the homologous aptamers of the tandem glycine riboswitch 214 have diverged, we broadened our taxonomic scope and focused our investigation onto 215 the individual aptamer domains of the glycine riboswitch. However, the shorter sequence 216 length of the individual glycine aptamers confounded our analysis. Thus, relative 217 conservation of aptamer-1 to aptamer-2 in different genomic contexts was investigated 218 using graph clustering of all Bacillaceae and Vibrionaceae aptamers within a given 219 genomic context, excluding identical aptamer pairs coming from different strains of the 220 same species. This set comprised of 84 pairs of aptamer-1 and aptamer-2 from 221 Bacillaceae regulating GCV and 36 pairs from Vibrionaceae regulating TP 222 Table S4, S5). The number of TP riboswitches was reduced by one in 223 this analysis compared to the previous, as one of the riboswitches was no longer unique 224 within the set when evaluating only the individual aptamer sequences. 225

(Supplementary
We generated networks comprised of vertices corresponding to individual glycine 226 riboswitch aptamers with edges weighted based on the pairwise RNAmountAlign distance 227 score (29). RNAmountAlign was chosen as the primary metric for edge-weighting in our 228 work due to its implemented use of primary sequence information and ensemble 229 mountain distance of secondary structure to generate a pairwise score more quickly and 230 efficiently than other software. After weighting with RNAmountAlign, edges were trimmed 231 if they were below a selected RNAmountAlign threshold, thus altering the topology of the 232 network from completely pairwise to containing clusters of aptamers whose similarity is 233 greater than the threshold. Thresholding was done across a range of RNAmountAlign 234 scores to identify conserved aptamer groups which retained tight clustering (Figure 2A). 235 Each network corresponds to a specific genomic context, TP or GCV. We find that within 236 these contexts, aptamers group based on their position within the tandem arrangement 237 (aptamer-1 vs. aptamer-2). Network density, defined as the fraction of edges present 238 within a group compared to the total number of edges in the non-thresholded network, 239 was calculated across a range of RNAmountAlign score thresholds and used to gauge 240 relative conservation of each aptamer type for each genomic context ( Figure 2B). 241 Differing cluster densities between aptamer types revealed that genomic context effects 242 which aptamer is more highly conserved: aptamer-1 is more highly conserved in 243 riboswitches regulating the GCV, while aptamer-2 is more highly conserved in those 244 regulating TP. A Wilcoxon rank-sum analysis of all intra-group edges was performed as 245 well to validate these findings ( Figure 2C). We obtain very similar findings using a variety To assess the relationship between singleton and tandem riboswitches, we implemented 252 graph clustering of individual aptamers from both tandem and singleton glycine 253 tandem derived aptamers. The first contains singleton type-1 aptamers and aptamer-1 of 277 tandem riboswitches, all regulating GCV. The second includes singleton type-2 aptamers 278 and aptamer-2 of tandem riboswitches, all regulating TP ( Figure 3B). 279 Members of both mixed communities were extracted and networks were generated 280 for each as described above (Figure 3C). For aptamers originally part of a tandem 281 arrangement, the paired aptamer was included to assess relative conservation 282 Table S8, S9) of the singlet aptamers to each tandem aptamer type. 283

(Supplementary
We determined relative conservation between aptamer types by calculating the network 284 density of edges connecting each aptamer type (inter-edge density) ( Figure 3D) across 285 a range of RNAmountAlign thresholds. We observe that singleton type-1 aptamers 286 regulating GCV are most similar to aptamer-1 of tandem riboswitches in the same context 287 and conversely that singleton type-2 aptamers regulating TP are most similar to aptamer-288 2 of tandem riboswitches in the same context. The inter-edge density between singleton 289 type-1 aptamers and tandem aptamer-1's regulating GCV is comparable to that seen 290 between singleton type-2 aptamers and tandem aptamer-2's regulating TP (Figure 3D). 291 These two groupings also represent the highest conservation across aptamer types within 292 their networks, with other pairings being comparable to inter-edge density measurements 293 with a random set of 40 aptamers (Supplementary Table S10). Using Dynalign, 294 FoldAlign, Clustal Omega, and RNApdist as distance metrics yields similar findings 295 (Supplementary Figure S3, S4). 296 To further investigate the similarities between these aptamers, we generated 297 consensus structures of the riboswitches found within each genomic context using a 298 combination of LocARNA, Infernal, and R2R (13,50,51). Consensus structures of 299 13 riboswitches regulating GCV show tandem aptamer-1 and singleton type-1 aptamers 300 have high conservation of the P2 and P3 stems, as well as the binding pocket, while the 301 P1 stem of tandem aptamer-2 shows high conservation with the singleton type-1 ghost 302 aptamer ( Figure 4A, B). This conservation of the ghost aptamer P1 stem correlates with 303 the region required for tertiary interactions of tandem and singleton riboswitches (38). 304 This is observed within riboswitches regulating TP as well, except the aptamer of 305 singleton type-2 and tandem aptamer-2 are the conserved aptamers ( Figure 4C, D). 306 Together, our results indicate three things about Bacilli glycine riboswitch aptamers 307 within each genomic context: 1) one tandem aptamer shows high conservation to the 308 singleton aptamer, 2) conservation between the alternative tandem aptamer and 309 singleton aptamers is no greater than conservation of the singleton to a random set of 310 glycine riboswitch aptamers, and 3) ghost aptamer location correlates with the less 311 conserved tandem aptamer. This fits with a model wherein these singleton riboswitches 312 are the result of tandem riboswitch degradation, and which aptamer to be conserved and 313 which to be degraded is dependent on genomic context. If the situation was reversed and 314 tandems were the result of duplication events of singleton riboswitches, we would expect 315 higher conservation between the singleton aptamer and both tandem aptamers compared 316 to a random set of glycine riboswitch aptamers. However, we only observe such 317 conservation with one tandem aptamer in each genomic context. 318 319 Actinobacteria riboswitches display similar clustering pattern observed in Bacilli. 320 To determine whether these patterns are observed within other clades of bacteria, 321 we gathered all glycine aptamers within our dataset in the Actinobacteria phylum 322 (distantly related to both the Vibrio and Bacilli classes analyzed previously), excluding 323 identical aptamers coming from different strains of the same species, totaling 606 324 Table S11). We then evaluated all aptamers within the set in 325 the same manner as our Bacilli investigation. Within this phylum, glycine riboswitches 326 primarily regulate GCV (74%) or other genes involved in glycine metabolism (22%). We 327 identified a group of 34 conserved aptamers corresponding to riboswitches regulating 328 GCV ( Figure 5A) and utilized de novo community detection algorithms to validate our 329 observation ( Figure 5B). The aptamers within this group comprised primarily singleton 330 type-1 aptamers and tandem aptamer-1 sequences, with five singleton type-0 and two 331 singleton type-2 aptamer sequences accounting for the remainder. The singleton type-2 332 aptamers within the set may be misclassified aptamers or examples of singleton aptamers 333 which do not conform to the patterns observed for other investigated aptamers. We 334 performed graph clustering on the group, with paired tandem aptamer-2s included as an 335 out-group, to investigate conservation of aptamer types ( Figure 5C) (Supplementary 336 Table S12). We then calculated the edge densities within and between singleton type-1 337 aptamers, tandem aptamer-1s, and tandem aptamer-2s, which demonstrate a clear 338 conservation between singleton type-1 aptamers and tandem aptamer-1s ( Figure 5D). 339

aptamers (Supplementary
These findings fit our conclusions drawn from the Bacilli class of bacteria. Using Dynalign, 340 FoldAlign, Clustal Omega, and RNApdist as distance metrics yield similar findings 341 (Supplementary Figure S5). 342 343 Clustering based on genomic context is observed throughout entire bacterial 344

kingdom. 345
To determine whether clustering patterns observed within the Bacilli class and 347 Actinobacteria phylum are reflected throughout the rest of the bacterial kingdom and can 348 be observed among randomly selected aptamers, we randomly selected 150 distinct 349 glycine riboswitch aptamers each for the GCV and TP genomic context (Supplementary 350 Table S13, S14). Our selection retained comparable numbers of each aptamer type and 351 excluded singleton type-0 aptamers. Singleton type-1 and type-2 aptamers are 352 underrepresented in the TP and GCV regulating sets, respectively, because each 353 aptamer type has few instances within that genomic context. Despite the diverse 354 taxonomic range represented within this dataset, the generated networks display 355 clustering patterns which align with our previous observations: a tendency towards 356 clustering of singleton type-1 aptamers with tandem aptamer-1s when regulating GCV, 357 and clustering of singleton type-2 aptamers with tandem aptamer-2s when regulating TP 358 Taking this analysis a step farther using graph clustering, we were able to 382 determine that genomic context dictates which aptamer within a tandem glycine 383 riboswitch is more highly conserved: aptamer-1 is more highly conserved in riboswitches 384 regulating GCV, while aptamer-2 is more highly conserved in those regulating TP. These 385 findings provide an elegant answer to a contradiction within the field in which 386 investigations of diverse glycine riboswitch homologs yielded different results for whether 387 ligand-binding of the first or second aptamer is more important for functionality (39,40). 388 Our results align with both studies' findings: the aptamer identified as the essential binding 389 partner for regulation in each study is the aptamer found in our study to be more highly 390 conserved within that genomic context. With the results of these previous studies 391 combined with this new perspective provided by our data, it is reasonable to conclude 392 that a difference in genomic context has driven glycine riboswitches to conserve different 393 primary ligand-binding aptamers. 394 Our observation that tandem glycine riboswitch evolution is affected by genomic 395 context led us to question the impact of genomic context on singleton glycine 396 riboswitches. We extended our network analysis to singleton riboswitches, which 397 provided valuable insight into the relationship of tandem and singleton glycine 398 riboswitches. Clustering of singleton and tandem aptamers from the Bacilli and 399 Actinobacteria clades revealed that singleton aptamers are more similar to the first or 400 second tandem aptamer based on genomic context: singleton type-1 aptamers regulating 401 GCV are more similar to aptamer-1 of tandems regulating GCV, while singleton type-2 402 aptamers regulating TP are more similar to aptamer-2 of tandems regulating TP. This 403 similarity of singletons to one tandem aptamer within a genomic context is highlighted by 404 the fact that the singleton aptamers show no higher similarity with the other tandem 405 aptamer than with a random set of 40 glycine riboswitch aptamers. This is observed within 406 both GCV and TP regulating riboswitches and directs us towards the conclusion that 407 singleton riboswitches are the remnants of degraded tandem aptamers.

Riboswitch Phylogenetic Analysis 462
Tandem riboswitches were grouped based on taxonomic origin and genomic context. In 463 order to incorporate secondary structure information, groups were aligned using 464 LocARNA's mlocarna function for de novo alignment and folding (50,68,69) and Infernal's 465 cmalign function to align to the tandem covariance model (12,13). Maximum likelihood 466 phylogenetic trees were generated from these aligned groups using RAxML (9). Trees

Consensus Structure Generation 496
We generated Stockholm files for sets of riboswitches using a combination of LocARNAs 497 mlocarna function and alignment to our covariance models using Infernal's cmsearch 498 function. R2R was then used to generate consensus structures based on these 499 Stockholm files (51). The "#=GF R2R SetDrawingParam autoBreakPairs true" flag was 500 used to allow for breaking of base pairs in instances where aptamer stems were not highly 501