Genomic Comparison of Escherichia coli O104:H4 Isolates from 2009 and 2011 Reveals Plasmid, and Prophage Heterogeneity, Including Shiga Toxin Encoding Phage stx2

In May of 2011, an enteroaggregative Escherichia coli O104:H4 strain that had acquired a Shiga toxin 2-converting phage caused a large outbreak of bloody diarrhea in Europe which was notable for its high prevalence of hemolytic uremic syndrome cases. Several studies have described the genomic inventory and phylogenies of strains associated with the outbreak and a collection of historical E. coli O104:H4 isolates using draft genome assemblies. We present the complete, closed genome sequences of an isolate from the 2011 outbreak (2011C–3493) and two isolates from cases of bloody diarrhea that occurred in the Republic of Georgia in 2009 (2009EL–2050 and 2009EL–2071). Comparative genome analysis indicates that, while the Georgian strains are the nearest neighbors to the 2011 outbreak isolates sequenced to date, structural and nucleotide-level differences are evident in the Stx2 phage genomes, the mer/tet antibiotic resistance island, and in the prophage and plasmid profiles of the strains, including a previously undescribed plasmid with homology to the pMT virulence plasmid of Yersinia pestis. In addition, multiphenotype analysis showed that 2009EL–2071 possessed higher resistance to polymyxin and membrane-disrupting agents. Finally, we show evidence by electron microscopy of the presence of a common phage morphotype among the European and Georgian strains and a second phage morphotype among the Georgian strains. The presence of at least two stx2 phage genotypes in host genetic backgrounds that may derive from a recent common ancestor of the 2011 outbreak isolates indicates that the emergence of stx2 phage-containing E. coli O104:H4 strains probably occurred more than once, or that the current outbreak isolates may be the result of a recent transfer of a new stx2 phage element into a pre-existing stx2-positive genetic background.


Introduction
Pathogenic Escherichia coli strains are capable of causing a number of disease states in humans and animals and colonizing a variety of niches within these hosts [1]. The ability of certain pathotypes of E. coli to colonize agriculturally important domestic animals and survive in meat products makes these organisms a particularly common cause of foodborne infections [2,3]. In addition some E. coli strains have been shown to colonize plant tissues following contamination of soils or irrigation water from infected herds or wildlife, resulting in large outbreaks that have been attributed to sprouts or contaminated vegetables [4,5,6,7,8].
In the case of enterohemorrhagic E. coli (EHEC) strains that produce Shiga toxins (Stx), infection of a susceptible host results in fever and bloody diarrhea, and can progress in some cases to hemolytic uremic syndrome (HUS) and other severe complications, which can be fatal [9,10]. Because of their relatively high pathogenicity and ease of transmission, pathogenic E. coli strains have been classified as potential agents of bioterrorism [11].
Human pathogenic E. coli strains exhibit a wide spectrum of phenotypes and clinical manifestations and can colonize a broad range of tissues and body sites. The tissue tropism for any given strain is largely dependent on the genetic armamentarium that each strain possesses. Strains can vary dramatically in their genetic complement; with a variety of exchangeable elements, including plasmids, transposons, pathogenicity islands, and other mobile elements, most notably cryptic, active, and lysogenic bacteriophage (reviewed in ref. [1]). These elements can, separately or together, carry elements encoding antibiotic resistance; bacterial toxins; extracellular structures promoting adhesion (fimbriae, pili); and the extracellular polysaccharide and flagellar subunits that designate their serotypes (e.g. O157:H7) [12,13]. New combinations of these chromosomal and extrachromosomal elements continually emerge and propagate in the environment and in susceptible hosts, leading to host shifts and new clinical presentations.
During the last two decades, most reported incidents of HUS have been attributed to E. coli strains belonging to serotype O157:H7 [14]. However, in the past few years diagnostic tests targeting Shiga toxins have allowed detection of Shiga toxinproducing E. coli (STEC) strains belonging to different serotypes from cases of hemorrhagic colitis and HUS [15]. Genes encoding Stx and Stx variants are located on transmissible prophages that are carried in the chromosomes of each strain. Stx-encoding prophage can excise and begin replicating when the bacteria are subjected to DNA-damaging growth conditions including the presence of antibiotics [16,17,18], and phage particles arising from such events can result in horizontal transmission of the Shiga toxin genes through the lysogeny of unrelated E. coli strains [19]. The Stx toxins themselves are thought to mediate the most severe consequences of STEC infection by causing toxicity and inflammation of the kidneys [9,20].
In contrast to classical EHEC strains, which colonize the intestine by means of an elaborate Type III secretion system encoded on a pathogenicity island [21] that facilitates the actinmediated formation of pedestals on host cell surfaces to which the colonizing bacteria adhere [22], enteroaggregative strains (EAg-gEC) exhibit dramatically different strategies for colonization and infection. During colonization of the colon and ileum, these strains express enteroaggregative fimbriae [23,24] and form dense biofilm-like aggregates that adhere tightly to the epithelial layer. These aggregates are rendered flexible by the expression of dispersin, which also aids penetrance of mucous layers, EAggEC strains express a repertoire of lineage-specific virulence factors which includes SPATE proteins, (serine protease autotransporter protein of Enterobacteriaceae; [25]) including the mucinolytic Pic protein [26], and more strain-specific toxins including Pet [27]. Like other pathogenic E. coli strains, EAggEC strains are susceptible to infection by lambdoid phages including stx phages: a report of a 2001 case of HUS caused by a stx2-positive EAggEC belonging to the O104:H4 serotype [28,29] was followed in 2009 by several cases of HUS and bloody diarrhea cases in the Republic of Georgia, which were eventually attributed to strains of stx2positive E. coli O104:H4 strains (ref. [30] and Chokoshvili O. et al., manuscript in preparation). Simultaneously with the work presented in this study, Beutin et al undertook an independent characterization limited to the stx2 prophages present in those 2009 strains [31].
Strains of the O104:H4 serotype harboring the Stx2-encoding phage received little notice until a stx2-positive EAggEC (StxEAggEC) strain caused a large outbreak centered in Germany [30,32] and a small outbreak in France [33]. In all, 16 countries in Europe and North America reported a total of 4075 cases and 50 deaths. Hemolytic-uremic syndrome (HUS) was a frequent complication of the illness in these outbreaks, occurring in 22% of the reported illnesses [34]. The source of the O104:H4 infection in Germany and France was traced to sprouts derived from fenugreek seeds that had been imported into Europe from Egypt in 2009 [9,10,35]. Early results indicated that the 2011 outbreak strain of E. coli O104:H4, while expressing the Shiga toxin typical of enterohemorrhagic (EHEC) strains, shared features of enteroaggregative (EAggEC or EAEC) strains [32,36]. While the strain was characterized rapidly by a series of efforts spread across the globe [29,36,37,38], no closed, finished sequence of the outbreak strain was presented. Furthermore, the origins and evolutionary history of the 2011 outbreak strain remain obscure.
We obtained isolates of E. coli O104:H4 strains from a previous case cluster in the Republic of Georgia in 2009 and compared them to a representative isolate of the 2011 outbreak strain. The isolate from the 2011 outbreak was obtained from a patient hospitalized in the United States who had travelled to the outbreak zone in Germany in May. To characterize the strain, we performed both multi-phenotypic analysis and whole-genome sequencing of each isolate using both classical and highthroughput molecular approaches. We present the first fully closed, finished genome sequences of each isolate and compare the genetic content of each strain including but not limited to the prophage regions. The 2011 outbreak strain was distinguishable from the 2009 strains by the presence of a plasmid encoding a CTX15 beta-lactamase and by differences in the prophage content, chromosomal and plasmid sequences, and the profiles of mobile genetic element insertions. Despite a common and very closely related core genome, each strain carried a distinct repertoire of unique genomic and plasmid regions, with particular divergence in the bacteriophage loci encoded in each genome. Furthermore we show induction of at least two distinct bacteriophages from two of the three strains. The phenotypic differences and molecular diversity of Shiga toxin-positive EAggEC suggest that significant uncharacterized diversity exists within this clade of E. coli strains that may pose additional outbreak risks.
in May 2011and strains 2009EL-2050 and 2009EL-2071 were isolated each from different patients in the Republic of Georgia (Table 1). Because the bacterial isolates used in this study are publicly available and non-identifiable, the work conducted with these isolates does not involve human subjects, as defined in the existing U.S. Federal regulations for human subject research (see 45 CFR 46.102(f)). Informed consent was not obtained since these isolates were collected in the course of routine patient care; secondary use of such non-identifiable isolates does not require informed consent per human subjects protection regulations.
The strains from this study were compared to the following E. coli O104:H4 strains previously described in the literature: strain TY2482 (stx2+ EAggEC from a case of bloody diarrhea in a 16 year old girl from Germany, 2011) [38], 55989 (stx-negative EAggEC from a case of persistent watery diarrhea in an HIVinfected adult from the Central African Republic, 2001) [38,39] and HUSEC041 (01-09591) (stx2+ EAggEC from a case of HUS in Germany,in 2001) [28,29].

Plasmid DNA Isolation
To visualize large plasmids present in the E. coli strains characterized here, the following protocol was used to isolate intact plasmids. Bacterial cultures were streaked out on Luria-Bertani (LB) agar plates from 270uC stocks and plates were incubated at 37uC for 18 hours for colony formation. Overnight cultures were prepared from isolated colonies grown on agar plates. Briefly, 20 ml of LB broth was inoculated with 3 to 4 isolated bacterial colonies and incubated at 37uC for 20 hours with shaking at 225 rpm. Three ml of each bacterial culture were centrifuged at 12,0006g for 2 minutes at room temperature, the supernatants were removed and the cell pellets were thoroughly suspended in 250 ml of an ice cold solution of 50 mM glucose, 10 mM EDTA, and 25 mM Tris-HCL, pH 8.0. All sample preparations were handled gently during and after lysis to prevent shearing of the supercoiled DNA. The tubes were incubated at room temperature for 5 minutes prior to addition of 250 ml of a solution of 0.2 N NaOH and 1% SDS to lyse the cells followed by gentle mixing by inverting the tube six times and holding at room temperature for 5 minutes. After lysis, 3 M potassium acetate solution pH 4.8 (250 ml) was added to a concentration of 1 M, and the lysates were mixed by gently inverting the tubes 10 times followed by centrifugation at 12,0006g for 5 minutes at room temperature. The supernatants were transferred to fresh 1.5 ml tubes, centrifuged another 5 minutes to avoid carryover of any precipitated material and the supernatants were transferred to new tubes. The cleared supernatants were treated with RNase A to a final concentration of 50 ug/ml and incubated at 37uC for 30 minutes. After RNase treatment, supernatants were extracted twice with phenol:chloroform and 3 times with chloroform:isoamyl alcohol. The nucleic acids from supernatants were precipitated by addition of 0.1 volume of 3 M sodium acetate (pH 5.0) and 100% ethanol followed by incubation at 220uC for an hour and centrifugation at 15,0006g for 30 minutes at 4uC. The pellets were rinsed with 1 ml of ice cold 70% ethanol and centrifuged as described above. The supernatants were discarded and the pellets were dried under vacuum without applying heat. The pellets were hydrated with 100 ml of distilled water and stored at 4uC overnight to rehydrate the DNA. DNA samples were concentrated to about 5 fold (,20 ml) under vacuum before loading onto an agarose gel for pulsed field gel electrophoretic analysis.

Pulse Field Gel Electrophoresis of Plasmid DNA
For each sample 20 ml of eluted plasmid DNA from 3 ml culture volume was analyzed by pulse field gel electrophoresis (PFGE). Briefly, 20 ml eluted sample was mixed with 4 ul of 6X loading dye (US Biologicals, Swampscott, MA) and loaded onto a 1% Pulse Field Certified Agarose (BioRad Laboratories, Hercules, CA) gel. Ten ml of high molecular range DNA ladder (5 kb DNA size standard, BioRad Laboratories, Hercules, CA) was loaded onto a lane as a size standard in the gel. The gel was electrophoresed in 0.56 TBE buffer, recirculated at 14uC. The run time was 18 hours at 6 V/cm with a 1 to 6 second switch time ramp. After completion of run the gel was stained with 500 ml of ethidium bromide (EtBr) solution (10 mg EtBr/ml of distilled water) for 1 hour at room temperature. The gel was de-stained with 1 liter of distilled water for 1 hour at room temperature prior to visualization in a UV transilluminator (BioRad) and photographed.

Whole-genome Sequencing
Genomic DNA was prepared using the Ultraclean Microbial DNA isolation kit (MoBio, Carlsbad CA). The draft genome sequences of all three isolates were generated by a consortium consisting of ECBC, NMRC and the LANL Genome Science Group using a combination of Illumina [40] and 454 technologies [41]. For each of these genomes we constructed and sequenced Illumina and 454 Titanium standard shotgun libraries, and a paired end 454 library (Table S1 and S2 in File S2). All general processes and protocols of library construction and sequencing can be found at http://www.jgi.doe.gov/. The 454 Titanium standard data and the 454 paired end data were assembled together with Newbler, version 2.3-PreRelease-6/30/2009. The Newbler consensus sequences were computationally shredded into 2 kb overlapping fake reads (shreds). Illumina sequencing data was assembled with VELVET, version 1.0.13 [42], and the consensus sequence were computationally shredded into 1.5 kb overlapping fake reads (shreds). The 454 Newbler consensus shreds, the Illumina VELVET consensus shreds and the read pairs in the 454 paired end library were integrated using parallel phrap, version 1.080812 (High Performance Software, LLC) [43,44]. Illumina data was used to correct potential base errors and increase consensus quality using the software Polisher developed at JGI (Alla Lapidus, unpublished). Possible mis-assemblies were corrected using gapResolution (Cliff Han, unpublished), or Dupfinisher [45], and further edited manually with Consed [46]. The initial high quality draft assemblies contained 68-77 contigs and 6-9 scaffolds. Gaps between the contigs were closed by editing in Consed, by PCR, and by primer walks. Each genome required 400-800 additional finishing reactions to close gaps, resolve repetitive elements, and correct low-quality sequence regions. The final assemblies are based on 84.

Optical Mapping
Confirmatory optical maps were generated according to manufacturer's procedures (OpGen, Gaithersburg MD) using NcoI and/or BamHI. Optical maps of the strains were compared to each other and to the previously published TY-2482 sequence using MapSolver (OpGen).

Identification of SNPs and Small-scale Genetic Variations
The finished sequences were compared to each other using the nucmer algorithm in MUMMER [48,49]. SNPs and Indels were reported directly from genome alignments while the unaligned portions of the respective genomes were tracked by coordinates and captured as gaps. Candidate variations in the nucleotide sequences in the finished sequences of each strain were confirmed by mapping the raw 454 and Illumina reads back onto the finished sequence using the GSMapper package in Newbler and/or the read-mapping tool in Genomics Workbench from CLC Bio. Variations evident in both the finished and mapped data and free of potential assembly conflicts (i.e. where .85% of the raw reads differed from the reference) were considered to be confirmed. Instances where the raw sequencing data from each strain conflicted with the finished sequence from the parent strain were considered to be errors in the final assembly.

Phylogenetic Analysis
The MCL clustering algorithm [50] was used to identify conserved core protein ortholog families between the German outbreak isolate 2011C-3493, TY-2482 strain, the two Georgian isolates 2009EL-2050, and 2009EL-2071 and all complete E. coli genomes available in NCBI (Table S3 in File S2). Escherichia fergusonii ATCC 35469T strain was used as an outgroup. Ortholog families with genes present in all E. coli as well as the German and Georgian isolates were used to create the core phylogeny. Protein sequences in each family were aligned separately using MAFFT [51], then concatenated by species into a mega-alignments using an in-house perl script and removing uninformative columns. A phylogenetic tree was constructed using RAxML [52] with the JTT+GAMMA+I model. This same approach was applied to the O104:H4 clade to highlight relationships within this group of closely related organisms using FasttreeMP and the general timereversible model [53].

Genomic Comparison
The genome comparisons at the nucleotide level were carried out with genome alignment tools, such as MUMmer2 [48], NUCmer [49], and the Artemis Comparison Tool (ACT) [54] (http://www.sanger.ac.uk/Software/ACT/). The comparison of genomic island insertion/deletion patterns was identified using the ACT alignment program at the default settings. Predicted genomic island insertion sites were identified from sequence alignments and breakpoint sites were further manually curated. The gene name and locus ID were assigned based on the NCBI Reference Sequence file.

Identification of Prophage Regions in Completed Genome Sequences
The finished genomic sequences of the three strains focused on in this study along with the published sequences of 55989 (NC_011748.1) and TY2482 (PRJNA67657) were annotated using the DIYA software [55]. The GenBank outputs of DIYA software were used as inputs for Phage_Finder analysis software [56] to identify potential prophage regions in the genomic sequences. The Phage_Finder outputs were manually curated to identify additional phage proteins outside of the regions identified by Phage_Finder and to validate the identity of the phage related genes encoded by the prophages [57]. Prophage similarity analysis was conducted using BLASTN version 2.2.18 [58]. Phage similarity was defined as any two phages having 95% or greater identity along 95% or more of genome length and position in the same general location on the E. coli chromosome. The manually curated Phage_Finder outputs sequences were further validated by BLASTn analysis of the whole-genome sequences of all five strains to establish uniqueness of each prophage and also to account for potential prophage sequences that might have been missed by Phage finder. To generate this image all the sequences were rearranged to start at the same nucleotide position. Accordingly, the genome position 1 was set at the first C nucleotide in the following sequence CATTATCGACTTTTGTTCGAGTG-GAGTCC. Extracted phage sequences were aligned to each other using the multiple alignment tool in CLC Bio.

Phenotypic Analysis
Each strain was inoculated into twenty 96-well OmniLog phenotypic microarray plates and grown at 37uC for 36 hours. Reduction of tetrazolium dye by respiring cells was measured every 15 minutes by optical density. A heatmap of the data was produced using PheMaDB [59]. Briefly, the area-under-the-curve (AUC) values from the three different biological replicates for each unique phenotype were averaged. The ratio for each AUC was calculated between the query strain and reference parent strain. For the purpose of visualization, 1920 phenotypes were included in the heat map (i.e. this better represents the locations of the phenotypes which correspond to different modes of action categories). The same ratios were used for the phenotypes that have replicates. The ratio values were formatted as PM1 to PM20 for each strain across the rows and wells A i to H i , where i = 1 to 12 for the columns (note that there were no values for wells H12). The results were plotted in a heat map using R [60]. Wells in which the query strain outgrew the reference strain are represented by green blocks while wells in which the reference strain outgrew the query strain are represented by red blocks.

Phage Induction
Bacterial strains were streaked for single colonies on tryptic soy agar plates from 270uC stocks and were incubated at 37uC for 18 hours. After initial incubation, single large isolated colonies were inoculated into the following media: 20 ml tryptic soy broth containing 25 mg/ml of mitomycin C, 4 mg/ml ciprofloxacin, or no antibiotics. Liquid cultures were grown at 37uC on a shaker for 18 hours. After overnight growth, 2 ml aliquots from each culture were taken and centrifuged at 10,0006g for 2 minutes to pellet the bacteria. Supernatants were then collected and transferred to 0.22 mm Spin-X centrifuge filter tubes and centrifuged at 10,0006g for another 2 minutes to filter sterilize. When not in use, supernatants were stored at 4uC. To estimate the number of viable phage particles produced spontaneously or under inducing conditions, supernatants were titered on the susceptible naïve indicator E. coli strain DH5a. For titration, E. coli DH5a was inoculated into 10 ml tryptic soy broth and grown at 37uC to an optical density (OD 600 ) of , 0.5. 100 ml of the DH5a culture was infected with 10 ml of 10-fold serial dilutions of the supernatant from uninduced and induced cultures and incubated at 37uC for 20 minutes for phage adsorption. 2.5 ml of molten top agar kept at 48uC was added to the bacterial-phage mix and immediately poured onto tryptic soy agar plates pre-warmed to 37uC. The top agar was allowed to solidify for 5 minutes at room temperature before incubating for 18 hours at 37uC. Plates were examined after overnight incubation and plaques were enumerated.

Traditional Tests and PFGE
Selected results of a panel of biochemical and phenotypic tests are given in Table 2, with results from the complete panel of tests in File S1. Most phenotypes of the three strains were similar.

Antibiotic Resistance
Antibiotic resistance profiles were determined for all three strains ( Table 2). The 2011 outbreak isolate was shown to be resistant to several antibiotics in the cephalosporin class. As in previous reports, these resistance elements correlated with the presence of a CTX-M-15 cephalosporinase encoded on the IncI1 plasmid that was present only in the 2011 outbreak strain. In addition, two of the strains were resistant to tetracycline, consistent with the presence of a mer/tet cluster in both strains. Multiphenotypic analysis using Omnilog (see below) confirmed this result, but also indicated that the 2011 strain is more sensitive to a number of membrane-disrupting agents including polymyxin B, chlorhexidine and paramomycin.  (Table S4 in File S2). The PFGE results are fully supported by whole genome sequence data described below.

Whole Genome Sequencing
Summary genome sequencing statistics for each replicon and additional statistics describing the Illumina and 454 paired-end datasets each strain are presented in Table S4 in File S2. We compared our finished sequence of the 2011 strain to the previously released TY2482 genome sequence and, where applicable, to other published sequences.

Plasmids
Plasmid sequences were compared using BLASTn to the nucleotide database at NCBI to identify their closest matches, which are given in Table 3. The plasmids and chromosome of Strain 2011C-3493 match closely to the previously published sequence of the TY2482 strain [38]. All three strains contained the small 1549 bp plasmid previously reported as pG2011 TY2482 [38] and referred to here as pG [61]. In addition, all three strains contained a ,72 kb plasmid homologous to pAA TY2482 [6], which we refer to as pAA [61]. The pAA variant in 2009EL-2071 (pAA-09EL71) contained an additional insertion caused by the expansion of a 944 bp tandem repeat that is also present on the chromosome of a transposon/retron element ( Figure 1B). The plasmid originally referred to as pESBL TY2482 [38] and referred to here as pESBL-EA11 was found only in the 2011 outbreak isolate 2011C-3494 and encodes the CTX-M-15 cephalosporinase. Strain 2009EL-2050 contained an additional IncF plasmid (p09EL50) that exhibited similarity to plasmids pLF82 and pHCM2 from the Adherent/Invasive E. coli strain LF82 [62] and Salmonella enterica subsp. enterica serovar Typhi, respectively [63] ( Figure 1C). The plasmid component of the Georgian strains was yet further distinct from the recently sequenced HUSEC041 strain, another O104:H4 StxEAggEC strain from a 2001 case in Germany [64]. Interestingly, the 1.5 kb pG TY2482 plasmid varied dramatically in copy number when coverage of each replicon in the datasets was compared (Table S4 in File S2). The read coverage of this replicon was not consistent with the quantity of plasmid recovered from this isolate ( Figure 1A), suggesting that differences in growth conditions may affect the apparent plasmid copy number.

Chromosome
To probe the relatedness of these isolates at the chromosomal level, we examined the chromosomal architecture of the three strains. To provide verification of the accuracy of our assemblies, large-scale assemblies were verified by comparison of the finished genomes to in-house generated optical maps for all three strains which matched the finished sequences for all three isolates (File S3). The chromosomes are very similar in overall architecture All strains were isolated from human stool and were positive by PCR for stx2a, aatA and aggR. The genes for stx1, eae, and ehxA were not detected by PCR.  ( Figure 2A) with no gross rearrangements detected when the chromosomes were aligned using MAUVE. A comparison between the chromosomes of TY2482 and 2011C-3493 showed only 29 SNPs (14 synonymous, 9 non-synonymous, and 6 intergenic) and 7 gaps (totaling 859 bp), with an additional 16 SNPs (all in intergenic regions) and 3 gaps (totaling 104 bp) identified between their plasmids pESBL TY2482 and pESBL-EA11 ( Figure 2B, File S4). Most of the SNPs between TY2482 and 2011C-3493 were clustered in a putative prophage region which may indicate a misassembly in the TY2482 genome sequence or rapid divergence of this region. All but two of the remaining SNPs (at positions 43333 and 1568661 (TY2482 coordinates) were identified previously as sequencing errors in the TY2482 reference [65]. Notably, 2011C-3493 shared the sequence with TY2482 at position 2252380 in the L-asparaginase 2 gene that delineated the German strains from other European isolates [65].
All gaps, small indels, and SNPs (including information on intergenic, synonymous or non-synonymous SNPs) found in all two-way comparisons between the finished German and Georgian genomes are enumerated in Table 4. The intersection of these differences show lineage specific differences between the strains. In addition, 3 gaps (12kbp total) were identified between the Georgian strains. These gaps represent genomic islands and deletions (see below) that differentiate the strains, and confirmed results using optical maps (ref. [66] and File S3). Many of the SNPs between the two Georgian strains and between each of the Georgian strains and 2011C-3493 cluster around putative prophage elements (see below), and are indicative of the divergent temperate phage residing in the chromosomes of these strains. However, other SNPs occur in core regions of the genome, providing evidence of divergence of the strains from a recent common ancestor.

Disrupted Genes
The 2011 and 2009 strains each contained a unique set of insertion elements at different positions on the chromosome ( Figure 2D). The 2011 strain had a unique profile of IS element insertions, while both Georgian strains appeared more similar to each other. A number of unique insertions resulted in disruption of genes due to mobilization of elements such as transposons ( Figure 2D; Table 5). Most of these insertions occurred in genes of unknown function, but several of the interrupted genes have homology to enzymes or transcriptional regulators (Table 5). Finally, several IS elements present in the plasmids of the Georgian strains are missing from the plasmid component of the 2011 outbreak isolate. Using IS element proliferation as a surrogate for genomic decay, this suggests that the plasmid present in the 2011 outbreak strain may have suffered less genomic decay than the Georgian strains. In addition to gene loss incurred by IS element transposition, each strain contains a unique repertoire of pseudogenes arising by nonsense or frameshift mutations (File S4). As with the IS elements, the profiles of these disrupted genes in the Georgian strains are much more similar to each other than they are to the 2011 outbreak strains. 2009EL-2071 in particular contains an interrupted gadE gene that is truncated by the insertion of an IS element. The transcriptional activator GadE regulates the transcription of genes involved in the maintenance of pH homeostasis during acid stress by controlling the decarboxylation of glutamine [67,68].

Phylogenetic Analysis
The pan-genome of all complete genomes of E. coli available in GenBank along with the German 2011 (2011C-3494) and coli. The protein sequences in each family were aligned separately and then concatenated by species. A phylogenetic tree was inferred from this core and as expected, the two Georgian isolates clustered closely with the 2011 outbreak strains (Figure 3). To allow better differentiation of the Georgian from the 2011 outbreak isolates, we used a similar approach to find all conserved orthologs from the O104:H4 clade (2323 protein families) and used the concatenated O104:H4 core sequences to obtain a phylogenetic tree. The Georgian strains formed a cluster distinct from the 2011 outbreak isolates (Figure 3, inset), while 2011C-

Genomic Islands
Several genomic islands and large genetic elements were found that differentiated the Georgian from the 2011 outbreak strains (Figure 2A, arrowheads). In addition to the variation in plasmid content, several large regions of divergence (RDs) were observed between the strains ( Figure 4A). These included prophages and large insertions/deletions mediated by recombination events and/ or mobile genetic elements. These are described in more detail below. Several additional islands were noted that were common to all of the strains examined in this study but absent from the draft sequence of the 2001 HUSEC041 isolate [29].
N Mercuric ion/tetracycline resistance island (RD1) -Both the 2011 (2011C-3494) and one of the 2009 Georgian isolates (2009EL-2050) contained an intact mercuric ion resistance locus. The 2011 strain contains an additional insertion of an IS element that separates the tet and mer operons. In the second Georgian isolate, this region appears to have been deleted by recombination between terminal repeat regions ( Figure 4B). In general, the gene content of the RD1 locus corresponds to the detection of those genes in the Georgian strains by microarray and PCR previously reported by Jackson and co-workers [66]. N Ula operon (RD2) -One potentially significant loss of activity in 2009EL-2071 is in an operon containing homologs of genes involved in anaerobic degradation of ascorbate [69] ( Figure 4C). We did not observe a phenotype for growth on ascorbate; however this is likely due to the use of aerobic growth conditions in our phenotype array experiments.

Integrated Prophages
To determine the nature and identity of potential prophages in our genome sequences, we utilized Phage_Finder, an automated bioinformatic algorithm designed to identify potential integrated prophage sequences [56]. Phage_Finder identifies regions of homology to a curated database of protein sequences and functional domains commonly associated with bacteriophage. The 2011 European outbreak strains, the Georgian strains, and strain 55989 were interrogated using Phage_Finder, which identified a total of 38 regions in all five genome sequences that could encode putative prophage regions (Table 6). Phage_Finder identified a total of 38 phage-like regions in the genomes queried and assigned putative left and right termini. However, to compensate for inherent biases in the algorithm, which had previously caused Phage_Finder to over-or under-call the number of potential prohage-encoded ORFs in 50% of genomes in its training dataset [56], the outputs were curated manually to identify regions outside of the termini identified by Phage_Finder.  Manual inspection revealed several additional regions that had been annotated by RAST as potential genes encoding phagerelated functions. Most of these genes missed by Phage_Finder encoded structural functions such as head and tail proteins, but in a few cases lysogeny and integrase functions were missed (not shown). Phage_Finder also identified potential attachment (att) sites of several of the prophage sequences (for details see Table S5 in File S2). Some of these att sites were utilized by different phages in different strains; for example prophage 55989-1 appears to occupy the same site as prophage B in the stx2a-positive strains. One potential false-positive region was identified, notated as Phage A in Figure 5, which contains few phage related genes (only a single integrase homolog was found); however several mobile elements and a putative restriction-modification system were identified within this region which may account for its being identified by Phage_Finder. Interestingly, prophage region A encodes one of the four SPATE protein homologs present in these strains, suggesting possible transfer of this element into these strains via a mobile element. Once identified in a single strain, phage sequences were utilized as BLASTn queries against the other finished genomes (including TY2482) and against the NCBI database, limiting the database to double-stranded DNA viruses with no RNA intermediate (NCBI Taxonomy ID #35237). The results of the BLAST analyses are shown in Table 7. For queries against the other bacterial strains, we set a cutoff of 95% identity over 95% of the length of the prophage genome for assigning the putative prophage as the same phage. A total of 15 discrete prophages were identified, only one of which (phage H) was common to all five strains ( Figure 5A). The stx2a-positive strains had six phages in common (A, D, E, F, G, H), with prophage B missing from 2009EL-2050. The two Georgian strains had replaced prophage C with prophage I; these two phages share extensive homology and may be distantly related variants or mosaics with another phage (File S5). One additional prophage sequence (prophage J) was observed in 2009EL-2071. There are some minor discrepancies in the sizes of the prophages between 2011C-3493 and the previously reported TY2482 sequence. These are likely due to minor misassemblies in the TY2482 genome sequence, which is a draft assembly in contrast to that of 2011C-3493, which is a finished sequence. Most of the other prophage regions identified by phage-finder contain a sizable repertoire of structural and non-structural phage genes, and exhibit a variety of architectures, mostly lambdoid in nature ( Figure 5B). Two of the putative prophage (E and G), exhibit architectures similar to typical Stx-converting phage [70]; indeed phage G encodes the stx2 genes. These prophages contain a distinctive gene that encodes a long protein (,2800 amino acids) of unknown function that is common to other stx2 phages [70]. In addition to the Shiga toxin genes themselves, several of the prophage encode potential virulence factors that may be present as phage morons (proteins encoded by phage that play no role in phage replication or structure yet confer upon the host bacterium important evolutionary advantages, such as during virulence). These include homologs of the ail/lom gene family (Phages C, F, G, H, I, J), and the bor (Phage G) genes. The ail/lom/bor genes belong to a family of enterobacteral outer membrane proteins expressed by lambda-like phage that confer eukaryotic cell invasion [71] and/or resistance to serum-mediated killing [72,73].

Structural Variation in Phage Regions Including the Shiga-toxin Phage
To determine the degree of divergence of the individual phage sequences within the finished genomes, we extracted the prophage sequences from the whole-genome sequence and aligned the sequences of each prophage individually (File S5). The results of this analysis revealed subtle structural variation in several of the prophages, most notably in the stx2 phage ( Figure 6A). Two other phages showed more subtle variations (D, F, and H), which can be ascribed largely to the mobilization of an IS element (D) or to the presence of small deletions (F and H). Phages C and I, although highly similar, were considerably more divergent and suggested a chimeric structure. The effect on the protein coding sequences of the variations were not particularly dramatic for all but the Shigatoxin phage ( Figure 6A); notably the 2011C-3493 isolates contain a deletion that spans a portion of a bor homolog, which may be involved in serum complement resistance [72]. Other functions that may be perturbed in the Georgian isolates are an antirepressor protein homolog (antA) and a rha homolog, both of which are disrupted by deletions. Pairwise comparison of the protein sequences encoded by the stx2 phage ( Figure 6A, bottom panel; see also File S6) revealed additional effects on protein sequence between 2011C-3493 and the two Georgian strains. The stx2 phages of the Georgian strains were otherwise indistinguishable from each other but for a single synonymous mutation in the stx2A gene.     A small region containing prophage-like genes in 2011C-3493 (positions 4128768-4145397 according to the normalized coordinate system in Fig. 5 and Table 6) is also present in strain 55989 and TY2482, but is absent from the Georgian strains ( Figure 6B). Curiously this region was missed by Phage_Finder, in spite of the presence of genes encoding putative primase, integrase, and antitermination functions. Phage_Finder misses approximately 10% of known phage sequences, so this result is not surprising [56]. The lack of genes encoding obvious structural proteins suggests that this prophage region might be degenerate.

Phage Induction
Given the extensive repertoire of prophages present in the E. coli outbreak strains, we asked if the prophages are cryptic or active and whether the strains produce viable phage particles that could explain the horizontal acquisition of stx2a phages by an EAggEc strain. It is well known that prophages, including the Shiga toxinencoding stx2a phage, could be induced by growing the lysogenic strains in the presence of ciprofloxacin, mitomycin C, or by other stimuli [74]. Phage particles were isolated from uninduced and induced cultures and plated on an indicator E. coli strain DH5a. We observed an increase of several orders of magnitude of phage production upon induction with ciprofloxacin and mitomycin C ( Figure 6C). The effect of inducing agents is much more pronounced in the case of mitomycin C and isolate 2011C-3493 compared to the 2009 isolates. The culture supernatants were examined by electron microscopy ( Figure 6D) for the presence of phage particles. At least two distinct lambdoid phage morphologies could be observed. Both phage morphotypes exhibited an icosahedral capsid. One morphotype exhibited short tails (933Wlike) ( Figure 6D, panels a and b), while a second morphotype exhibited a typical Siphoviridae-like morphology with long, noncontractile tails ( Figure 6D, panels c,e,f). The 933W-like morphotype was common among the 2009 and 2011 isolates. All of these morphotypes have been observed for prophages including stx phages induced from STEC strains [74]. The genomic analysis coupled with the isolation of distinct phage morphotypes indicate that multiple distinct, viable prophages are encoded within the genomes of these strains. Despite repeated attempts, we have thus far been unable to obtain stable stx2 phage lysogens of E. coli K-12 from the 2011 strain; we therefore cannot definitively assign a morphotype to the stx2 phage.

High-throughput Phenotypic Analysis
In order to understand the functional differences between the three O104:H4 isolates, a high throughput phenotypic characterization was undertaken. We employed OmniLog Phenotypic Microarrays (PMs) and conducted a pair-wise comparison of the strains using the area under the curve (AUC) values that result from measuring the reduction of tetrazolium dye (as an indicator of growth) under the various conditions tested. AUC ratios and P values for each well for each pair-wise comparison were calculated and those that demonstrated a two-fold or greater increase or decrease in growth as compared to the parent strain and which were found to be statistically significant are presented in File S7, along with the chemical name and mode of action. Those wells that exhibited significantly different phenotypes are circled on a heat map display of the overall results in Figure 7. Along the bottom of each heat map panel is displayed a color coding scheme based on the range of values computed for the AUC ratio of test strain versus parent strain, and the color that represents each value per panel varies with range.
From this analysis, several trends were observed. Overall, the two Georgian isolates were found to be more similar to each other phenotypically than to the 2011 isolate, as evidenced by the limited range of the AUC ratios for this pair-wise comparison (23.3 to 11.7) versus the other two comparisons (which had ranges of 27.9 to 12.7 and 27.1 to 11.9). Of the differences that were found amongst the three isolates, most differences were found in PM11-20, which assay for growth in presence of various antimicrobial compounds. These results were consistent with the genomic data. Isolate 2011C-3493 was found to be more resistant to cephalosporin (encircled by blue solid ovals) and beta-lactam (blue dashed ovals) antibiotics than either of the two Georgian isolates, probably due to the presence of the large IncI1 plasmid that encodes the CTX-M-15 cephalosporinase. Additionally, the Georgian isolate 2009EL-2071 is more sensitive to tetracyclines (yellow ovals) and less sensitive to chelating agents (black ovals) than the other 2 strains, consistent with the deletion of the mer/tet locus.

Discussion
The severity of the 2011 outbreak centered in northern Germany and the high rate of progression to HUS among infected patients indicated that the O104:H4/HUSEC041 clade of StxEAggEC strains may represent a significant new threat to public health. As a function of ongoing biological and public health engagement and biosurveillance by the Georgian Centers for Disease Control and Public Health and their partner agencies in the United States, the analysis of several previously uncharacterized O104:H4 strains from a relatively unheralded cluster of cases in 2009 was undertaken. The analysis presented here reveals previously unreported genetic diversity among StxEAggEC strains and suggests that multiple lineages of such strains may currently be circulating worldwide. While this diversity was previously suggested by comparisons of gene content and optical maps between the 2011 and 2009 outbreak strains [66], the whole-genome sequences of these strains provide a high level of resolution and unambiguous placement for these genetic acquisitions and losses. Our results concur with those presented in a recent description of virulence factors present in a Georgian O104:H4 stx2-positive strain [75]. The strains of the European outbreak were shown to be clonal in a recent genomic epidemiology study [65]. While this isolate unquestionably belongs within the clonal group that includes the German isolates, the presence of previously unreported SNPs relative to the TY2482 genome in 2011C-3493 suggests strongly that at least some previously unsampled diversity is present within the German isolate group, although it is not clear at this time whether these two mutations were present in the population that seeded the outbreak or whether these represent the products of inhost evolution.
Considerable diversity is observed in the prophage component of these strains. This is not surprising, given the highly mobile nature of phage genomes and the prominent roles of phage in transferring genetic material between bacterial strains [76]. These differences are particularly evident in the related C and I prophages and in the variant stx2 prophage. The divergence of the stx2 phage in these strains strongly suggests the possibility that two separate stx2 phage acquisition events may have contributed to the emergence of these strains (Figure 8). stx2 phage can exhibit wide genetic diversity [77] and highly mosaic phage genome structures suggestive of frequent recombination between phage variants [70,78]. While the origins of these particular stx2 phages are not clear, their inducibility is similar to that of other phage previously reported in stx2 phage-containing E. coli strains, and therefore these and related phages may be exchanging freely in the environment. The discovery of differences in the lysogenic stx2 phage between these strains may provide a clue to the high apparent pathogenicity of the 2011 outbreak strain, which is supported by the recent demonstration of higher inducibility of the Stx2 toxins and mRNA of the 2011 strains relative to O157:H7 strains in the presence of antibiotics [79]. While the effect on toxin production of each of the lysogenic stx2 phage in each of the isolates is not clear at this time, a previous study by Wagner and co-workers of the effect of phage genotype on toxin production using isogenic host strains harboring diverse stx2 phage yielded a broad range of toxin production levels. These differences were especially evident in uninduced (antibiotic-free) cultures [80].
While the clinical profile of the 2011 and 2009 strains appears similar, the acquisition of the IncI1 plasmid containing a broadspectrum cephalosporinase differentiates the 2011 strains from the 2009 and HUSEC041 strains [29,37,64]. The mobility of these plasmids and their worldwide distribution highlight the concern over the acquisition of multi-drug resistance determinants by highly pathogenic strains. A very similar CTX-M-15-positive IncI1 plasmid was recently discovered in an isolate of Shigella sonnei Figure 7. Pair-wise heat map phenotypic comparison of three E.coli strains. Each strain was assayed for growth in the presence of various chemicals using OmniLog phenotypic microarrays, as detailed in Materials and Methods. Each well represents the average of three biological replicates. The columns represent the well position, and are denoted as A i to H i (i = 1 to 12) from the left to the right of the plot in each array along the x-axis (note that there were no values for wells H12). Each cell ratio value represents the average of three biological replicates. Plates PM01-PM10 contains single wells for each growth condition, while plates PM11-PM20 contain quadruplicate wells for each growth condition. Those wells which exhibited a two-fold or greater difference in growth and which were statistically significant, with P value less than 0. in 2006 [81]. The presence of circulating CTX-M-15-containing plasmids worldwide [82], as well as other, even broader-spectrum beta-lactamase enzymes such as blaNDM-1 [83,84], offers ample opportunity for the acquisition of this or similar multidrugresistant plasmids to enter previously sensitive O104:H4 strains. In addition to the diversity in plasmid content, we also observed differences in the prophage content of the 2009 and 2011 outbreak isolates. While the exact roles of the divergent prophages in these isolates are not clear, several studies indicate that phage lysogeny can affect phenotypes of host strains in unexpected ways [85,86,87,88]. Notably, the divergent phages between the Georgian and German isolates harbor different phage-encoded virulence factors, particularly of the ail/lom/bor family. These may contributed in unexpected ways to the phenotype(s) of these pathogens in vivo although their exact roles if any are not known at this time. Finally, questions remain about the infection mode and relative virulence of each strain. Establishment of an animal model that accurately reflects the human disease profile will be critical in future experiments for study in greater depth of this class of highly virulent E. coli pathogens.
Our multiphenotypic analysis revealed an unexpected trait in both of the Georgian strains, namely an increased relative resistance to polymyxin B and other membrane-disrupting agents. The basis for this resistance is not clear from the genotypes of these strains, nor is it immediately obvious whether this is a trait that was gained by the Georgian strains or lost by the 2011 isolate. While no obvious mutations (e.g. in the phoPQor pmrAB-regulated genes involved in regulated lipid A modification) were found in the datasets to which this phenotype could be attributed, regulation of membrane modification processes in E. coli is complex and highly dependent on growth conditions [89,90]. One or more of the 2009EL-2071-specific mutations or resident prophage may contribute to this phenotype, but at this time its genetic basis remains unclear. Given that polymyxins, particularly colistins, can serve as last-line antimicrobial agents for extensively drug-resistant enterobacterial strains, particularly those that express extendedspectrum beta-lactamases, the discovery of potential emerging resistance in this highly pathogenic lineage is a concern.
Establishment and maintenance of robust global biosurveillance networks, with special emphasis on public health and disease monitoring systems and policies will be critical in the future to identify emerging disease threats such as the strains described in this study. Accurate characterization of such isolates will be increasingly important, and sequence information and comparisons can rapidly be generated both at large genome centers and using ''crowd-sourcing'' methodologies [38]. Finally, as the technology supporting small, more portable sequencing platforms matures, whole-genome analysis will be able to be conducted closer to the point of care during outbreaks, enabling true realtime application of genomic information to the characterization and management of ongoing disease outbreaks. File S7 Omnilog data summaries (separate tabs for average AUC values for each strain; unfiltered foldchange/significance; and filtered fold-change/significance). (XLSX)