Genome-wide identification of 5-methylcytosine sites in bacterial genomes by high-throughput sequencing of MspJI restriction fragments

Single-molecule Real-Time (SMRT) sequencing can easily identify sites of N6-methyladenine and N4-methylcytosine within DNA sequences, but similar identification of 5-methylcytosine sites is not as straightforward. In prokaryotic DNA, methylation typically occurs within specific sequence contexts, or motifs, that are a property of the methyltransferases that “write” these epigenetic marks. We present here a straightforward, cost-effective alternative to both SMRT and bisulfite sequencing for the determination of prokaryotic 5-methylcytosine methylation motifs. The method, called MFRE-Seq, relies on excision and isolation of fully methylated fragments of predictable size using MspJI-Family Restriction Enzymes (MFREs), which depend on the presence of 5-methylcytosine for cleavage. We demonstrate that MFRE-Seq is compatible with both Illumina and Ion Torrent sequencing platforms and requires only a digestion step and simple column purification of size-selected digest fragments prior to standard library preparation procedures. We applied MFRE-Seq to numerous bacterial and archaeal genomic DNA preparations and successfully confirmed known motifs and identified novel ones. This method should be a useful complement to existing methodologies for studying prokaryotic methylomes and characterizing the contributing methyltransferases.

S7 Table shows sequence logos corresponding to all of the motifs shown in Table 5. Each logo was derived from all putative CCMD reads of the (16,16) length corresponding to the motif in question. The read length and number of reads for each used are shown in the table.
Sequence logos have the advantage of visualizing certain sequence features not readily discernable from the computationally determined motif or nucleotide distributions. For example, an SSN repeat context is apparent around the methylated CTCGAG and TGCA sites of Halorubrum. In addition, the sequence requirements of the MspJI cleavage enzyme are easily visualized in the case of the GATC motif from A. calcoaceticus, the GCGC motif of the M.HhaI clone, and the GGWCC motif of the M.AvaII clone. Finally, the lack of sequence context in the case of FspEI-cleaved A. variabilis genomic DNA shows why the CGATCG motif was not determined automatically.
However, the logos can also be misleading, as in the case of the RCCGGY motif of MspJIcleaved A. variabilis, in part because it does not account for the dependencies between bases. Although the logo suggests a motif of RCHDGY, this is due to the presence of reads of other origin. S8 Table, which breaks down the representation of all possible RCHDGY sites in A. variabilis, shows that those sequences conforming to RCCGGY (in red) are almost completely methylated (as determined by MFRE-Seq reads), while all other sequences are not. S1 Table. Base filtering of selected read structures containing CCWGG (l-2 = 31) or CCWGG (l-4 = 29). a "Mm" = MTase motif was known from previous work, but methylated base was not; "R (Mm)" = MTase motif could be inferred from that of a characterized cognate REase; "Mms" = MTase motif and specific methylated base were known from previous work and confirmed here; "-" = neither motif nor methylated base were known previously and were newly determined here.   (Table 5) Logo Sequence logos corresponding to the motifs identified in Table 5, including the length and number of sequences from which each was built. (See S2 Text for further information.) S7 Table. S8 Table. Analysis of the apparent RCHDGY motif in the sequence logo of A. variabilis (see S2 Text and S7 Table).

Motif
Sites in Genome   Figure S3. Examples of read distribution with 3 digest cleanup protocols. All 3 samples were digested with MspJI and sequenced on the Ion Torrent platform. For each length, the blue bar indicates the number of unique sequences and the orange bar indicates the number of additional duplicate sequences, so the combined height indicates the number of total reads. A. One-step spin-column cleanup, which keeps all fragments, small and large; Arthrobacter sp., CCMD length = 34. B. Two-step spincolumn cleanup, which selects for fragments < 100 bp; E. coli DHB4, CCMD length = 31. C. Gel-purification of small fragments (20-50 bp range) from 20% polyacrylamide; E. coli DHB4, CCMD length = 31.

Read Length
Number of Reads