EcoBLMcrX, a classical modification-dependent restriction enzyme in Escherichia coli B: Characterization in vivo and in vitro with a new approach to cleavage site determination

Here we characterize the modification-dependent restriction enzyme (MDE) EcoBLMcrX in vivo, in vitro and in its genomic environment. MDE cleavage of modified DNAs protects prokaryote populations from lethal infection by bacteriophage with highly modified DNA, and also stabilizes lineages by reducing gene import when sparse modification occurs in the wrong context. The function and distribution of MDE families are thus important. Here we describe the properties of EcoBLMcrX, an enzyme of the E. coli B lineage, in vivo and in vitro. Restriction in vivo and the genome location of its gene, ecoBLmcrX, were determined during construction and sequencing of a B/K-12 hybrid, ER2566. In classical restriction literature, this B system was named r6 or rglAB. Like many genome defense functions, ecoBLmcrX is found within a genomic island, where gene content is variable among natural E. coli isolates. In vitro, EcoBLMcrX was compared with two related enzymes, BceYI and NhoI. All three degrade fully cytosine-modified phage DNA, as expected for EcoBLMcrX from classical T4 genetic data. A new method of characterizing MDE specificity was developed to better understand action on fully-modified targets such as the phage that provide major evolutionary pressure for MDE maintenance. These enzymes also cleave plasmids with m5C in particular motifs, consistent with a role in lineage-stabilization. The recognition sites were characterized using a site-ranking approach that allows visualization of preferred cleavage sites when fully-modified substrates are digested. A technical constraint on the method is that ligation of one-nucleotide 5' extensions favors G:C over A:T approximately five-fold. Taking this bias into account, we conclude that EcoBLMcrX can cleave 3' to the modified base in the motif Rm5C|. This is compatible with, but less specific than, the site reported by others. Highly-modified site contexts, such as those found in base-substituted virulent phages, are strongly preferred.


Methodology
The program WebLogo 3 [1,2] was employed to identify and visualize overrepresented motif sequences at the sites of cleavage--that is, the sites at which the adaptors had ligated most often.
For this analysis, the amount of data was too large to compute in its entirety.
Instead, we considered only favored cleavage sites. The number of reads (X) that align to a particular genomic location was defined as digestion frequency. The top N favored loci were considered with WebLogo. N was derived by considering the number of potential sites (Y) in the substrate and the total number of reads (Z). If the reads were randomly distributed, they would end at the site with a frequency of Z/Y. A locus is considered high-frequency (and included in N) if X> Z/Y. Sequence logos represent the information content at a particular position -how conserved or how highly specified it is. The overall height of each stack is proportional to the sequence conservation, defined as the difference between the maximum possible entropy and the entropy of the observed symbol distribution, at that position. The maximum sequence conservation per site is 2 for DNA and RNA (log 2 4) when all the S3 File WebLogo Results 2 alignments contain a specific base at a given position. If each of the four bases is equally represented at that position, the height is zero bits. The height of each symbol (A,T, G,C for DNA) is proportional to the observed frequency of the corresponding nucleotide. If two bases are equally represented, the sequence conversation (overall stack height) is 1 bits, and both bases appear as 0.5 bits in the logo. When generating the sequence logo, the frequencies of digestion events at a particular base sequence context (normalized to the sequence occurrence in the reference sequence, Y/Z) also contribute to the strength of the signal (i.e., the height of the stack).
Coordinates in this discussion refer to positions in a 5-base sequence, with the convention of Figure 1 of the main text in parentheses. Figure A and logo figures adhere to the convention of Figure 1 of the main text.

Results and Discussion
(1) When C is methylated in all contexts, as in XP12, the restriction sequence of all the 3 enzymes is [AG]CN[AG]C-the 1 st (-2) and 4th (+2) base can be either G or A.
(2) When C is only methylated in the GC context-e.g., pBR322.M.CviPI (GmC) and pBR322fnu4HIM (GmCNGC)--the 1 st (-2) position is always G for all the 3 enzymes suggesting the cytosine at position 2 (-1) is required to be methylated. (4) All 3 enzymes seem to have more relaxed recognition on hydroxymethylated substrates (or more star activities). This could reflect the ability to accept the methyl group of T in a low GC environment (5) The observed enrichment of G/C at the 3 rd (+1) base is probably due to ligation bias because when the single-base overhang is G or C, they pair more efficiently than the A-T pair.

5
NOTE The sequence logos below were generated from all the restriction events at highfrequency digestion loci as described above, so that the sequences of preferred loci got more weight in logo generation.