Genome-Wide Methylation Patterns in Salmonella enterica Subsp. enterica Serovars

The methylation of DNA bases plays an important role in numerous biological processes including development, gene expression, and DNA replication. Salmonella is an important foodborne pathogen, and methylation in Salmonella is implicated in virulence. Using single molecule real-time (SMRT) DNA-sequencing, we sequenced and assembled the complete genomes of eleven Salmonella enterica isolates from nine different serovars, and analysed the whole-genome methylation patterns of each genome. We describe 16 distinct N6-methyladenine (m6A) methylated motifs, one N4-methylcytosine (m4C) motif, and one combined m6A-m4C motif. Eight of these motifs are novel, i.e., they have not been previously described. We also identified the methyltransferases (MTases) associated with 13 of the motifs. Some motifs are conserved across all Salmonella serovars tested, while others were found only in a subset of serovars. Eight of the nine serovars contained a unique methylated motif that was not found in any other serovar (most of these motifs were part of Type I restriction modification systems), indicating the high diversity of methylation patterns present in Salmonella.


Introduction
The methylation of DNA is important in all kingdoms of life as a mechanism of epigenetic control [1][2][3]. Methylation is achieved through the action of methyltransferase enzymes (MTases), which covalently attach methyl groups to DNA bases. In eukaryotes, 5-methylcytosine (m5C) is the most common methylation. In contrast, N 6 -methyladenine (m6A) is the most frequent methylation in prokaryotes, although N 4 -methylcytosine (m4C) and m5C are also widespread.
Methylation in eukaryotes has been well studied and is known to mediate diverse processes including growth, development, and disease [4]. In prokaryotes, methylation is a key component of restriction-modification (RM) systems, which protect cells from foreign DNA. RM systems are composed of multiple proteins, including at least one MTase, which recognizes and methylates a base contained within a specific sequence motif, and one endonuclease, or REase, which cleaves foreign DNA with a methylation pattern different from that of the host DNA. RM systems are subdivided into four main classes that differ in subunit composition, motif characteristics, cofactor requirements, and location of DNA cleavage (for review, see [5]). In brief, Type 1 RM systems are composed of two restriction subunits (R), two methylation subunits (M) and one specificity subunit (S), which recognizes specific DNA sequences. Recognized motifs are asymmetric and bipartite. Type II systems include one R and one M subunit which can function independently, and recognized motifs are mostly symmetric. Type III systems are hetero-oligomers composed of a mod subunit (recognizes and modifies DNA) and a res subunit that is only active in a mod-res complex. The only RM systems that recognize methylated, instead of unmethylated sites, are Type IV. Methylation in bacteria also influences critical processes including gene regulation, cell cycle control, pathogenicity, and DNA repair [2].
Despite the important implications of bacterial methylation, its distribution, diversity, and functional consequences have not been extensively investigated. This paucity of data can, in part, be attributed to technological limitations. Methylation studies in eukaryotes have been facilitated by the development of detection methods for m5C, including bisulfite conversion, which allows for genome-wide modification analyses. Comparable methods have not been available for the detection of m6A and m4C until recent advances in sequencing technology. SMRT sequencing couples whole-genome sequencing with the simultaneous detection of base modifications using kinetic signals during DNA polymerization [6,7]. This new technology has led to insights regarding the methylomes of several bacterial species [8][9][10][11][12]. However, methylation is widespread throughout the bacterial kingdom and is very diverse [13]. Thus, more studies are needed to gain a comprehensive understanding of the distribution and diversity of methylation motifs and their associated MTases, and ultimately to comprehend methylation functions and evolutionary history in these organisms.
Salmonella enterica is the leading cause of death and hospitalizations due to foodborne pathogens each year [14]. Previous studies have shown that the methylation of the G m6 ATC motif by the MTase Dam is an essential factor in the virulence of Salmonella, and that a lack of methylation leads to attenuation in animal models [15]. Subsequent studies have elucidated the mechanisms by which some virulence genes are regulated by Dam, including the plasmid-encoded fimbriae (pef) locus [16] and the std fimbrial operon [17]. In addition, Dam regulates both the phase variation of STM2209-STM2208 which alters lipopolysaccharide O-antigen side chain length [18], and the phase variation of the phage P22 glucosyltransferase (gtr) operon which controls O-antigen glucosylation [19]. Thus, it is possible that the methylation of other motifs in Salmonella also may have implications for virulence, pathogenicity, and other functions. Here, we sequenced and closed the genomes of six Salmonella enterica isolates from five serovars. We then analysed their methylomes, along with the methylomes of four additional serovars that we sequenced previously [11,[20][21][22], and employed a bioinformatics approach to identify methyltransferases and match them to observed methylated motifs in the genomes. We also examined how methylation patterns varied between Salmonella serovars.

Materials and Methods
We selected five serovars of Salmonella enterica subs enterica from our in-house strain collection at the FDA-CFSAN. These included Salmonella enterica subs enterica serovar (S. Bareilly), S. Abaetetuba, S. Abony, S. Anatum, S. Bredeney, S. Montevideo, and two isolates of S. Enteritidis. We also included data from four serovars we sequenced previously, S. Javiana, S. Typhimurium, S. Heidelberg, and S. Cubana [11,[20][21][22] (see Table 1 for strain names and accession numbers). The Methylomes of Nine Salmonella enterica Serovars Each strain was plated onto Trypticase Soy Agar and incubated overnight at 37°C. Cells were then inoculated into Trypticase Soy Broth for DNA extraction. A 1 ml-aliquot was pelleted, and genomic DNA was extracted using the DNeasy Blood and Tissue kit from Qiagen (Qiagen, CA, USA). All samples were analyzed at the exponential stage of growth.
DNA was sheared to approximately 10 kb using a Covaris g-TUBE (Covaris, Inc.; Woburn, MA). SMRTbell 10 kb template libraries were prepared using DNA Template Prep Kit 2.0 and the Low-Input 10 kb Library Protocol (Pacific Biosciences; Menlo Park, CA, USA). In brief, DNA was concentrated, repaired, ligated to hairpin adapters, and purified. Incompletely formed SMRTbell templates were digested with a combination of Exonucleases III and VII. Adapters were annealed, and SMRT sequencing was carried out on the PacBioRS II (Pacific Biosciences; Menlo Park, CA, USA) using standard protocols.
Analysis of sequence reads was implemented using SMRT Analysis 1.10 and the SMRT Portal 2.0 platform (Pacific Biosciences). De novo assembly was performed using the Hierarchical Genome Assembly Process (HGAP) with default parameters [23]. HGAP consists of three steps to ensure high accuracy. First, Basic Local Alignment with Successive Refinement (BLASR) is used to align all reads to the longest seed reads and a consensus is generated to create pre-assembled reads. Preassembled reads are then assembled using the Celera assembler. Finally, all reads are mapped to the de novo assembly and final consensus and accuracy scores are determined using the Quiver consensus algorithm. HGAP outputs assemblies with overlapping regions at the ends. Coordinates of this region were identified using dot plots in Gepard [24], and trimmed from one end to circularize the genome. Genomes were checked manually for even sequencing coverage. Genomes were annotated using the NCBI (National Center for Biotechnology Information) Prokaryotic Genomes Automatic Annotation Pipeline [25] (http://www.ncbi.nlm.nih.gov/genomes/static/Pipeline.html). Prophages were detected using PHAST [26]. Only prophages scored as intact are reported here. We excluded putative intact prophages that did not show significant sequence similarity to known phages using the Basic Local Alignment Search Tool (BLAST) sequence alignment tool with default parameters.
Motif Detection and Analysis was also carried out using SMRT Analysis 1.1 and the RS_Modification_and_Motif_Analysis.1 protocol as described at http://www.pacb.com/pdf/ TN_Detecting_DNA_Base_Modifications.pdf. Interpulse durations (IPDs) were measured based on the kinetic signals [7] and processed as described previously [6]. At each position in the genome, the observed IPD was compared to the IPD of an in-silico control using a twosample t-test, and a QV score was calculated as QV = -10 log (p-value). Bases were accepted as modified based on a minimum QV threshold value. QV 30 was used as a threshold for preliminary analyses. A plot of QV versus coverage was then constructed using publicly available R scripts found at: https://github.com/PacificBiosciences/motif-finding. The observed bimodal distribution of kinetic data, resulting from modified and unmodified positions, was then used to determine a more stringent QV threshold (S1 Fig). Only sites with a minimum of 25x coverage were included. Motifs were identified using the algorithm MotifMaker. m6A and m4C motifs can be reliably detected with 25x coverage across all positions in the genome, but m5C requires either significantly higher coverage (~100x) or Tet-methylation for confident detection. In this study we report only m6A and m4C methylations. To identify MTases, assembled genomes were scanned for homologs of RM system genes using in-house software (e value > 1e-11) to identify putative MTases as previously described [10]. Predicted specificities were assigned to candidate MTases based on specificities of the known MTases. The presence of functional motifs and information regarding the placement of the gene within the genome were also used to support or reject those assignments, as were known characteristics of different MTase types. For example, Type III MTases and most Type IIG systems only methylate one strand of their recognition sequence, whereas Type I systems have bipartite recognition sequences. MTase candidates with predicted specificities were matched where possible with observed motifs found in our motif analyses. If a single candidate MTase existed for an observed motif, then that gene was assumed to be responsible for that particular specificity. If multiple candidates existed for a single motif, no MTase was assigned. When making assignments of new motifs to specific MTases, we always cross-checked the matched gene against other similar genes in REBASE and against the unassigned motifs from the more than 700 other genomes for which we have PacBio data. In many cases, the same motif occurred in a different genome with an essentially identical methyltransferase or specificity subunit protein sequence, adding weight to the strength of the assignment. Raw processed PacBio data files were deposited in the Sequence Read Archive (SRA) database of the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/sra) (S2 Table) and MTase information and sequences were deposited in REBASE (http://rebase.neb.com/rebase/rebase.html).

Genome Assemblies
All genomes were assembled into a single, circular chromosomal contig and up to three plasmids. Consensus accuracy scores were at least 99.9999% for all assemblies. Sizes of Salmonella chromosomes ranged from 4,547,600 -4,977,480 bp, plasmid sizes ranged from 3,609-221,009 bp (Table 1). Sequences were deposited in GenBank. Putative prophages and BLAST alignment data are reported in Table 1.

Methylation Patterns
This is the first comparative report of genome-wide methylation patterns in the pathogenic bacteria Salmonella enterica. We analyzed the methylomes of five Salmonella enterica subsp. enterica serovars, including two isolates of S. Enteritidis. We also sequenced and released their closed genomes. We present those results, along with data from four additional Salmonella serovars, S. Javiana, S. Typhimurium, S. Heidelberg, and S. Cubana, which we analyzed previously [11,[20][21][22]. In total, we observed 18 motifs among the nine Salmonella serovars, 16 m6A motifs, one m4C motif, m4 CCWWGG, and one Type I MTase which encodes both m6A and m4C activities, G m6 ATGN 5 G 4m GC (Fig 1; an underscore represents the base which is methylated in the opposite DNA strand; W = A or T). Eight of the motifs were novel, i.e., they have not been previously observed in any bacterial species. We were able to match 13 of the Salmonella motifs to their respective MTase enzymes in most of the serovars tested (S1 Table).
Several motifs were common among multiple serovars, while other motifs were unique to specific serovars. All Salmonella serovars examined contained the methylated motifs ATGC m6 AT, CAG m6 AG, and G m6 ATC. In all serovars, we identified a Type III MTase responsible for the methylation of CAG m6 AG, and an extremely common Type II MTase was found to methylate the ATGC m6 AT motif (see Table 2 for a list of enzyme names specific to each strain). The methylation of ATGC m6 AT was never complete (38-78.5%). This MTase is usually active in Salmonella, although rarely active in E. coli, and is not thought to be an essential gene [27]. Confident assignment of an MTase to the G m6 ATC motifs could only be performed in eight of the eleven isolates: two were orphan MTases, and the remaining were common Type II enzymes. In multiple serovars, we identified candidate enzymes that have the potential to methylate this motif (Table 3).
Other observed motifs were common among a subsection of the serovars examined. For example S. Typhimurium and both isolates of S. Heidelberg contained the common motif G m6 AGN 6 RTAYG that is methylated by a Type I MTase. Six of the nine serovars, S. Bareilly, S. Abony, S. Cubana, S. Javiana, S. Montevideo, and S. Anatum, contained a motif not found in the other serovars tested (Fig 1). For example, in S. Anatum, we observed the motif CC m6 AN 7 TGAG. Fig 2 shows the kinetic signals of three of these motifs. In most cases these unique motifs were strongly methylated. Several novel motifs were not matched to any MTases including GG m6 AN 6 ATTA and RA m6 ACN 5 TGA in S. Cubana, and CG m6 AYN 7 RTRTC in S.

Montevideo.
Several observed motifs could not be assigned to a single MTase. In some cases, there were multiple MTases with predicted specificities that matched that of an observed motif. In these cases, it was not possible to predict which enzyme was responsible for the methylation of the observed motifs, and thus no enzyme was assigned. Furthermore, we could not rule out the possibility that multiple enzymes methylated the same motif, as has been observed with G m6 ATC [28]. MTases may also be promiscuous [29], i.e., they methylate multiple motifs, making a match to any single motif unrealistic. In some cases, there was no MTase present in the genome with a specificity predicted to recognize an observed methylated motif.
On other occasions, we did not observe the methylation of a motif that we predicted would be present based on a putative MTase identification. For example, in S. Heidelberg CFSAN002064, we detected the gene for the putative methyltransferase Sen2064ORF15615P, and predicted that it would be responsible for GATC m6 AG methylation. However, we did not observe the activity of this methyltransferase in S. Heidelberg, which means the enzyme is inactive. Inactivity can be the result of a mutation in the enzyme which renders it inactive, or, the enzyme may be functional, but not at the time of analysis. For example, some MTases may be inactive due to transcriptional silencing as is often found when the genes are present as part of a prophage [30]. Furthermore, an MTase may be transcribed, but for unknown reasons, may not routinely modify its' target motif [12]. Cloning MTase genes has shown to be a useful approach for their characterization [6], and may help to match motifs to predicted MTases in cases where bioinformatics alone was insufficient. This approach should be incorporated into future studies that target particular MTases. For example, the cloning of Sen2064ORF15615P in an expression vector would resolve whether the enzyme is inactive or not functional in S. Heidelberg at the time of analysis.
We cannot completely rule out the possibility that DNA MTase genes exist that show no similarity to characterized MTase genes. However, with methylation data from more than 700 genomes available and almost 2,500 characterized and 50,000 putative MTase genes identified in REBASE, the chances of finding a completely new way of methylating DNA are getting increasingly smaller. In particular, we rarely come across a case where we can be certain that there are insufficient MTases to account for the observed patterns of methylation. However, in Salmonella enterica subsp. enterica serovar Heidelberg CFSAN002064, the methylated motif ACC m6 ANCC occurs, which may indicate a plasmid is missing. This contrasts with CFSAN002069, which also has this motif, but does have a potential plasmid-encoded MTase.
In other cases we have observed this motif is present in strains containing plasmids (R.J. Roberts, unpublished). Furthermore, as more genome sequence data and PacBio methylation data appear, our ability to predict recognition sequences from sequence data alone is growing. Already, rules are becoming apparent for predicting the specificity of Type IIG enzymes [31].
Most of the novel motifs observed in each serovar were modified by Type I RM systems (Fig  1). Type I systems have a modular structure that may allow sequence specificities to diversify more easily than the structures of other RM types (for review, see [32]). Each system consists of two methylase (M) units, two restriction endonuclease (R) units, and one sequence specificity (S) subunit [33,34]. The S subunit has two TRDs, each of which recognizes one half of the target motif. Recombination events may occur on the S subunit, either within a single TRD or within the sequence that joins the two, resulting in novel specificity. Also, R and M subunits may interact with foreign S subunits entering the cell, also resulting in novel specificity. This has been observed in Lactococcus [35]. One interesting Type I motif, G m6 ATGN 5 G 4m GC, is exhibited by the specificity subunit of the SenAboIII system. This example of cooperation between an m6A methylase and an m4C methylase is quite rare and has only been infrequently observed previously (R. Morgan, unpublished observations). The Methylomes of Nine Salmonella enterica Serovars Unique motifs found among closely related taxa may be the result of horizontal gene transfer (HGT). Studies have demonstrated that HGT accounts for the movements of RM systems based on evidence of codon usage bias [36] and differential GC content of RM genes [37]. We identified several MTases that are located on prophages and plasmids, indicating possible mechanisms of transfer (Table 1). Also, through BLAST similarity searches against REBASE we found that several MTase sequences are most similar, or highly similar, to enzymes in Enterobacteriaceae genera other than Salmonella, suggesting that these systems may have been acquired via HGT. For example, M.SbaUII from S. Bareilly, which methylates the motif C m6 AGCTG, is most similar to an MTase found in Pectobacterium. Currently, we are building a robust Salmonella phylogeny, including representatives of other Enterobacteriaceae genera, to test these and other evolutionary hypotheses.
In some taxa, we detected a proportion of motifs that were not fully methylated within the genome. In particular, only 38-78.5% of ATGC m6 AT sites across the genome were methylated, and 89.3-100% of CAG m6 AG sites were methylated ( Table 2). Orphan MTases or RM systems with an inactive REase often do not methylate all sites in the genome, as complete methylation at all sites to protect from cleavage is usually unnecessary. Incomplete methylation may also be due to the fact that cells are analyzed at different times during the cell cycle, or methylation at certain sites may be inhibited by DNA binding proteins [38]. Environmental factors, including culture conditions, may also affect the frequency of methylation [9,39]. Incomplete methylation may play a role in the regulation of gene expression. Thus, studies examining the functional implications of ATGC m6 AT and CAG m6 AG methylation will be particularly interesting.
In several of the genomes, ATGC m6 AT methyltransferases are biased towards preferentially methylating this motif when preceded by a cytosine, a thymine, or both. For example, in S. Heidelberg CFSAN002069, AATGC m6 AT and GATGC m6 AT are methylated at lower frequencies than TATGC m6 AT and CATGC m6 AT. All four motifs are found in a roughly 1:1:1:1 ratio throughout the genome, indicating a true bias in methyltransferase activity. Currently, we are investigating the biological significance of these observations. Interestingly, 20 ATGC m6 AT motifs are present in a collection of 101 previously characterized Salmonella virulence genes [40], and ten of these are AATGC m6 ATs, a much higher proportion than what is expected by chance.

Conclusions
In total, we observed 18 motifs among the nine Salmonella serovars, eight of which are novel. These findings indicate the diversity of motifs present in Salmonella enterica. The functions of the observed motifs are unknown, except for G m6 ATC, which has been well studied and is involved in a variety of biological processes including virulence [15]. In E. coli, methylation of CTGC m6 AG by the MTase M.EcoGIII, is shown to affect the transcription of over 30% of genes [12]. It is possible that the methylation of motifs in Salmonella described here may also play a role in virulence and other cell functions, and thus merit further study. Future studies should also continue to explore how methylation patterns vary across serovars, and examine within-serovar variation. Methylation may be useful as a typing marker, as closely related taxa are often difficult to differentiate using morphological and molecular markers. The reconstruction of a Salmonella phylogeny, along with the analysis of the methylomes will allow us to address these issues and gain a more broad view of the evolutionary history and functional significance of methylation within the genus.