Virulence gene profiles and phylogeny of Shiga toxin-positive Escherichia coli strains isolated from FDA regulated foods during 2010-2017

Illnesses caused by Shiga toxin-producing Escherichia coli (STECs) can be life threatening, such as hemolytic uremic syndrome (HUS). The STECs most frequently identified by USDA’s Microbiological Data Program (MDP) carried toxin gene subtypes stx1a and/or stx2a. Here we described the genome sequences of 331 STECs isolated from foods regulated by the FDA 2010–2017, and determined their genomic identity, serotype, sequence type, virulence potential, and prevalence of antimicrobial resistance. Isolates were selected from the MDP archive, routine food testing by FDA field labs (ORA), and food testing by a contract company. Only 276 (83%) strains were confirmed as STECs by in silico analysis. Foods from which STECs were recovered included cilantro (6%), spinach (25%), lettuce (11%), and flour (9%). Phylogenetic analysis using core genome MLST revealed these STEC genomes were highly variable, with some clustering associated with ST types and serotypes. We detected 95 different sequence types (ST); several ST were previously associated with HUS: ST21 and ST29 (O26:H11), ST11 (O157:H7), ST33 (O91:H14), ST17 (O103:H2), and ST16 (O111:H-). in silico virulome analyses showed ~ 51% of these strains were potentially pathogenic [besides stx gene they also carried eae (25%) or 26% saa (26%)]. Virulence gene prevalence was also determined: stx1 only (19%); stx2 only (66%); and stx1/sxt2 (15%). Our data form a new WGS dataset that can be used to support food safety investigations and monitor the recurrence/emergence of E. coli in foods.

In order to cause illness, STEC strains need a set of genes that allow them to attach, colonize, and produce and secrete Shiga toxin protein [9][10][11][12]. STECs have the capacity to produce attaching and effacing (A/E) lesions on intestinal mucosa, mediated by proteins found on the locus of enterocyte effacement (LEE) Pathogenicity Island. These include EAE protein (intimin, coded by the eae gene), secreted effector proteins (Esp), a TIR receptor, and other T3SS effectors present in the LEE island. However, some STECs do not carry eae (LEE-negative STEC strains). They possess other genes believed to compensate for the lack of eae or LEE island (e.g. saa) and still caused sporadic cases of HUS [13,14]. Other putative virulence genes (ehxA, espP, etpD, toxB, katP, subA, saa, and sab genes, among others) are usually located in a plasmid referred to as the virulence plasmid to differentiate it from other possible plasmids that can be carried by the same strain [9,11,15]. All STECs have a virulence plasmid, although differing in sizes and gene content (either virulence or plasmid maintenance genes). Although the precise role of ehxA in STEC pathogenesis remains to be elucidated, several studies indicate an association of ehxA in clinical disease since 1) ehxA was found to be produced by many STEC associated with diarrheal disease and HUS [16][17][18], and 2) serum samples from HUS patients have been shown to react specifically to ehxA [19].
Beyond the noted outbreaks, there have been several reports on STECs found in FRFDA [22,23]. The most comprehensive survey was the USDA Microbiological Data Program (MDP) that collected domestic and imported fresh fruit and vegetable samples from primarily terminal markets and wholesale distribution centers from 2001-2012 (https://www.ams.usda. gov/datasets/mdp/mdp-program-data-and-reports). This program tested approximately 15,000 samples annually, and tested for the presence of Salmonella, E. coli O157:H7, and other resultant DNA extract was stored at -20˚C until used as a template for whole genome sequencing. The concentration was determined using a Qubit double-stranded DNA HS assay kit and a Qubit 2.0 fluorometer (Thermo Fisher Scientific, Waltham, MA), according to the manufacturer's instructions.

Whole genome sequencing, contig assembly and annotation
The genomes of the strains sequenced in our laboratory used an Illumina MiSeq sequencer (Illumina, San Diego, CA), with the 2x250 bp pair-end chemistry according to the manufacturer's instructions, at approximately 80X average coverage. The genome libraries were constructed using the Nextera XT DNA sample prep kit (Illumina). Genomic sequence contigs were de novo assembled using default settings within CLC Genomics Workbench v9.5.2 (QIA-GEN) with a minimum contig size threshold of 500 bp in length.

in silico serotyping, determination of virulence genes, and antimicrobial resistance genes identification
We used Ridom software for performing batch screening of determinants for each of the sequenced genomes.
The serotype of each strain analyzed in this study was confirmed using the genes deposited in the Center for Genomic Epidemiology (http://www.genomicepidemiology.org) for E. coli as part of their web-based serotyping tool (SerotypeFinder 1.1 -https://cge.cbs.dtu.dk/services/ SerotypeFinder) [37,37,38]. Each whole genome sequence was screened for O-type or H-type genes. The virulence genes present in each strain were determined using the genes deposited in the Center for Genomic Epidemiology (http://www.genomicepidemiology.org) for E. coli as part of their VirulenceFinder 1.5 web-based tool (https://cge.cbs.dtu.dk/services/VirulenceFind er) [38]. Each WGS was screened for the presence of 102 virulence genes (95 virulence genes previously reported here [27] plus 7 additional genes) (S2 Table). These 102 virulence genes included a repertoire representing genes found in different E. coli pathotypes (ETEC, STEC, EAEC, and EPEC) in order to detect any possible E. coli hybrid present. Antimicrobial resistance genes present in each sequenced genomes were identified by using the genes deposited in the Center for Genomic Epidemiology (http://www.genomicepidemiology.org) as part of their Resfinder 2.1 web-based tool (https://cge.cbs.dtu.dk/services/ResFinder) [39]. Each WGS was screened for the presence of those AMR genes reported in the database.

Phylogenetic relationship of the strains by cgMLST analysis
The phylogenetic relationship of the strains was assessed by a core genome multilocus sequence typing (cgMLST) analysis using Ridom SeqSphere+ software v2.4.0. The genome of O157:H7 strain Sakai (NC_002695.1) was used as the reference for the cgMLST. This E. coli strain has 5,204 genes, of which 3,860 genes (core genes) were present in the six genomes used as comparison to generate the cgMLST scheme (NC_011353.1 -O157:H7 strain EC4115, NC_002655.2-O157:H7 strain EDL933, NC_013008.1 -O157:H7 strain TW14359, NC_013941.1-O55:H7 strain CB9615, NC_017656.1-O55:H7 strain RM12579, and NC_017906.1 -O157:H7 strain Xuzhou21). While 791 genes were found in some of the compared genomes, the remaining of the genes were eliminated from the analysis since they were paralogous or pseudogenes. Therefore, a total of 4,651 genes were used as templates for the analysis of the STECs from this study. After eliminating loci that were missing from the genome of any strain used in our analyses, we performed a cgMLST analysis. These remaining loci were considered the core genome shared by the analyzed strains. We used Nei's DNA distance method [40] for calculating the matrix of genetic distance, taking into consideration only the number of same/different alleles in the core genes. A Neighbor-Joining (NJ) tree using the appropriate genetic distances was built after the cgMLST analysis. The discriminatory index was calculated with the Ridom software using the Simpson's discriminatory index as described [41]; cgMLST uses the alleles number of each loci for determining the genetic distance and builds the phylogenetic tree. The use of allele numbers reduces the influence of recombination in the dataset studied and allows for fast clustering determination of genomes.

Nucleotide sequence accession numbers
The draft genome sequences of 331 E. coli strains used in our study are available in GenBank under the accession numbers listed in S1 Table.

Presence of STEC in FDA regulated foods
Among 331 suspected STEC strains isolated from FRFDA between 2003-2017 and sequenced by several labs and deposited at NCBI, only 276 were confirmed to be STECs by in silico analysis for the presence of either stx1 or stx2 (S1 Table). Those that were stx negatives either might have lost their phages or were initially misidentified. Of the 196 identified by MDP and sequenced by our lab, 92% carried either stx1 or 2 (181/196). Of the 74 strains which genomes were retrieved from NCBI and were initially isolated and sequenced by FDA ORA, 94% carried either stx1 or 2 (70/74). Of the 63 E. coli strains isolated and sequenced by a FDA contracting laboratory, 43% carried either stx1 or 2 (25/61). The frequency of isolation of STECs from foods are listed in Table 1. STECs were isolated from 22 food commodities. Most of these STECs were isolated from spinach (32%), flour (21%), lettuce (13%), and cilantro (12%).
The frequency of STEC isolation per year, their sequence type, food commodity and state of isolation (if available) is listed in Table 2. The minimum number of STECs recovered from FDA regulated foods per year during the period 2010-2017 was 12 in 2012 and the maximum was 53 in 2016, for a median of 30 STECs per year. We used 2010 as our starting year, since the number of strains prior to that year were found sporadically and came from our STEC historical collection. The STEC analyzed strains were isolated in 22 states.

Characterization of STEC strains by in silico MLST
Among the 276 STECs analyzed in this study we identified 95 different sequence types (STs) by in silico MLST. Strains belonging to a ST were isolated between 1 to 12 times in the period studied (2010-2017) ( Tables 3 and S3). The majority of the STs identified were observed only one time (45%).

Characterization of STEC strains by serotyping, and virulence gene profiles
However, a strain belonging to a known ST that caused hemorrhagic colitis (HC) is not enough to predict the likelihood of it causing disease illness. Therefore, we further characterized these STECs by in silico virulence determination and their predicted serotype ( Table 4). The detailed in silico analysis for presence of virulence genes and serotype is listed in S2 Table.  Table 4 lists only the serotype and some of the most known virulence genes for each strain: stx1 variants, stx2 variants, eae variants, ehxA, espP, etpD, toxB, katP, subA, saa, and sab. We identified at least 81 different serotypes among the 276 STECs sequenced ( Table 4). Many of the O types were not present in our O types database and were listed as unknown or the wzx and wzy gene were not assembled. Among those some of the most common clinical STECs serotypes were identified, such as: O157:H7, O26:H11, O113:H21, O121:H19, O91:H21, O103: H2, and O111:H8. Other serotypes like O153/O178:H19 could not be discerned by in silico analysis since both O types have identical wzx and wzy gene sequences. This issue is discussed in more detail elsewhere [37].

Phylogenetic relationship of the STEC strains by cgMLST analysis
The phylogenetic relationships among the 276 STECs from this study determined by cgMLST analysis is shown in Fig 1. The genome of O157:H7 strain Sakai (NC_002695.1) was used as the reference for the cgMLST. The initial phylogenetic analysis [Neighbor-Joining (NJ) tree] based on gene differences (allele based) among these 276 STECs (Fig 1) revealed a complex evolutionary history with the existence of multiple, highly diverse genomic variants of strains isolated from RFFDA. Some of these genomes formed discrete groups and clustering was consistent with their ST (ex. all ST655 strains clustered together). A further analysis by a minimum  spanning tree allows visualization of allele differences between strains with the same ST that was not seen with the NJ tree (Fig 2).

eae positive Non-STEC strains virulence gene profiles
Among the 55 non-STECs (lacking either stx gene by in silico analysis) strains isolated from FDA regulated foods, we found 35 that were positive for the eae gene (S4 Table). The majority were classified as atypical EPEC (aEPEC) eae + and bfpA -. Two of them (IEH-NGS-ECO-00094, and IEH-NGS-ECO-00100) carried bfpA -the first gene of the bfp operon that encodes a type IV pilus-(eae + and bfpA + ) but were missing most of the common genes found in typical EPEC (S4 Table, typical EPEC lineage 1 strain E2348/69). Therefore, we classified them as aEPEC.

Discussion
STECs are the most dangerous among diarrheagenic E. coli affecting public health worldwide [5,7,23,27,42,43]. Usually the most threatening STEC are those of O157:H7 serotype [44,45]. However, in recent years there has been an increase in the occurrence of many non-O157 serotypes in humans associated with consumption of contaminated food, including produce and other FDA regulated products [2,46,47]. Some studies have characterized STECs presence and their virulence potential from FDA regulated products [20,23,48]. Most of the STEC isolated from those products have been only initially screened for the presence of some virulence genes using PCR [23,49]. In the present study, we performed an in-depth analysis with whole genome sequencing of 331 presumptive STEC strains. These strains were isolated from FDA regulated foods recovered during a period of 2003-2017 by two surveillance programs (FDA ORA, and MDP USDA). STECs were isolated from 22 food commodities. It is worth mentioning that even though the sampling did not occur in all states, the food commodities had nationwide (or at least multistate) distribution.

CFSAN053338
-  [69/95-73%] had been reported as causing disease in humans. Furthermore, of these potential human pathogenic STECs, strains belonging to 18 of those STs (19%) were additionally associated with strains causing EHEC-related illnesses (Table 3). Among the known ST associated with causing HC illnesses or HUS cases we found: ST21 and ST29 (O26:H11), ST11 (O157:H7), ST33 (O91:H14), ST17 (O103:H2), and ST16 (O111:H-), among others [7,42,50,51]. There are some STs, that have been described as EAEC/ETEC, however in our dataset, the strains with those STs did not carry any virulence genes for those pathotypes. These results emphasized the need to analyze the data as a whole (e.g. to include virulence genes);ST alone is not an accurate predictor of pathotype or virulence.
Although some strains have the same ST, they show differences in their virulence profile as well as their Shiga toxin gene content. For example, there were 5 strains that were ST10 and from these only 2 were classified as STECs, with one carrying stx1a while the other carried stx2a. Both were negative for any of the attaching genes (eae, or saa genes), and therefore considered low risk for causing infection in a healthy individual. This demonstrates that a single characteristic (e.g. ST or serotype) is not enough to make an inference of the potential pathogenic trait of any STECs (http://www.fao.org/documents/card/en/c/CA0032EN). The better way is to take all the information into consideration (ST, stx type, attaching genes, serotype, etc) in order to make a more informed prediction of the pathogenic potential of any STEC in conjunction with historical available data on clinical cases. For example, a strain of O113:H21 stx2a positive that doesn't possess eae but has saa (a gene that encodes an auto-agglutinating adhesion) and has been linked to HUS cases [14], could be potentially harmful to humans. A similar analysis could be done in the case of any STEC that has all those attributes but has not been linked to any human cases. Even though we cannot predict the actual outcome of an infection with this strain, it still warrants a warning about its presence in foods that are consumed raw as is the case with fresh produce.
We tested for 95 known virulence genes [27] found in the most common E. coli pathotypes and did not find any genes present that would characterize the strains as STEC/EAEC//ETEC/ EIEC hybrids. Among the adherence factors, eae and saa genes were found in 24% and 26% of the STEC strains, respectively. Strains that carry eae did not carry saa, and vice versa, as perviously observed for STEC isolated from fresh produce [23]. Regarding the presence of Shiga toxin type, there was great variation with most strains (67%) carrying only stx type 2, 19% carrying only stx type 1 while 15% carried both stx types. Among the stx2 there were 144 that were either a,d, or c variants, which are the stx2 variants found among clinical cases [31,32,[52][53][54] and that have a specific tropism for humans [53]. The remaining 40 STECs carrying stx type 2 alone were stx variants e, d/e and g which have been found in animal reservoirs [55]. The remaining putative virulence genes were sporadically found with the most common exhA gene found in 61% of the STECs, while espP was found in 43% of the STECs. These two genes can be found in the virulence plasmid and appear to participate in STECs infection in humans [9,11,15]. In summary, 46 of the STECs analyzed in this study carried a combination of stx2a/d and eae genes. Many researchers and food safety institutions considered a STEC with this virulence gene combination to possess an elevated risk to humans [22,53]. On the other hand, 94 of these STECs carried a combination of stx2a/d/c and saa genes. In this case, the risk is harder to determine and are considered of moderate risk if serotype was previously found associated with human cases. We also confirmed the presence of antimicrobial resistance (AMR) genes in some of the analyzed STEC strains with a low prevalence (12%). However, those few strains carried multiple antimicrobial resistance genes. The presence of strains carrying multiple AMR genes is worrisome since they can be shared amongst other E. coli and could possible participate in the dissemination of AMR in their environments, as has been observed for tetracycline genes in E. coli isolates from beef cattle [56], for colistin resistance (mcr-1 gene) through plasmid-mediated transfer [57], and for ampicillin resistance genes in E. coli in an infant treated with antibiotics [58].
Phylogenetic analysis by a custom cgMLST analysis of these 276 STECs confirmed the MLST in silico analysis, with many different defined clades among these STECs isolated from FRFDA. The cgMLST analysis is a fast method of analysis and provides an initial visualization of the relationships among the strains analyzed. Comparable results have been observed for establishing fast relationships among genomes from diverse bacterial pathogens [27,29,[59][60][61][62][63][64]. A further analysis using only the genomes of strains that are located within each individual or among selected clades can produce a more detailed evolutionary history, using single nucleotide analyses. This SNP analysis can help in determining the potential source, phylogenetic nature, lineage, and timeline of transmission of each group, as has been shown for the ST36 lineage of Vibrio parahaemolyticus [65].
EPECs are the leading cause of infantile diarrhea in developing countries [66,67]. EPEC strains do not carry stx genes but typical EPECs (tEPEC) have eae and bfp genes, and their main reservoir is humans [68]. The eae gene is located in the chromosome, in the LEE operon, while the bfp operon is typically located in the large EPEC adherence factor (EAF) virulence plasmid [68]. These tEPEC also carry the perA gene, which increases the expression of LEE elements [68,69]. Interestingly, 11% of our presumptive STECs were shown by in silico analysis to be atypical EPECs (aEPEC). Among their unusual features are the absence of the EAF plasmid, and their reservoirs can be animals or humans [68]. It is possible these aEPECs might had lost their phages upon culturing, as this pattern has been observed in clinical isolates of E. coli upon sub-cultivation [70]. The aEPEC we observed in this study may have the capacity to produce A/E lesions, since they carried both the eae and tir gene, which are the effector and receptor necessary for the formation of the A/E lesion [71].
Our results suggest that finding aEPECs in food could be of particular concern, as these strains have the potential for acquiring the stx phage, as observed in the E. coli O104:H4 strain found in Germany [72]. That strain was an entero-aggregative E. coli (EAEC) that had acquired an stx2a phage, and human illnesses that resulted during 2011 became the largest known HUS outbreak of STEC-related illness in the world [72]. Similarly, an O26:H11 strain 21765, isolated in 2005 during a milk cheese outbreak in France [73] was shown to be an EPEC strain that had probably acquired a stx2a phage [27]. In Gonzalez-Escalona et al (2016), the authors demonstrated that some strains of E. coli O26:H11 isolated from US cattle were phylogenetically more closely related to ST29 O26:H11 EHECs but t because these did not carry the stx phage, they would have been classified as EHEC-like by previous methods [74]. Over the last five years, the analyses of thousands of E. coli genomes have revealed that socalled E. coli "hybrid strains"-strains that belong to one pathotype but acquire virulence markers, such as stx genes, from another pathotype-could be more common than previously believed.
We are heading to a new phase in surveillance of STECs in the US by using a genomic monitoring approach and the genomes of the STEC isolated from FRFDA provides a solid foundation to build upon [75] (https://www.cdc.gov/pulsenet/pathogens/wgs.html). A database already exists that achieves the first goal of source tracking by using core genome information (NCBI pathogen detection tool). However, there is a need for improved databases that allow for fast analysis of the WGS data for detecting virulence genes, phages and plasmids content, as well as antimicrobial resistance genes.
In conclusion, STECs were isolated from diverse FRFDA food sources during the period study. The contamination frequency was relatively low (median 30 STEC strains isolated per year). However, fifty percent of the STECs analyzed in this study carried either a combination of eae plus stx, or saa plus stx, therefore being potentially pathogenic to humans. Moreover, those STECs carried most of the virulence genes described for STECs causing infections with a diverse range from HC (e.g. ST655 O111:H19 strains) to HUS (e.g. ST21 O26:H11 strains) [20,42]. Some others have not been described as causing disease in humans but have the potential to do so (e.g. ST342 O5:H-unknown strains) since they carried all virulence genes described in pathogenic strains (stx1a, eae-beta1, exhA, tir, and many of the T3SS effectors and non-LEE effectors) (Table 4). Nonetheless, the determination of the presence of STECs in FRFDA with the potential to cause disease in humans reinforces the need to continue surveillance of this important pathogen which is of importance for food safety and public health. Furthermore, the availability of these genomes could provide early warnings of food contamination from cattle or other animals, since some of the STEC isolated were carrying stx2e that have been usually observed causing edema in pigs [76] and are considered as probably non-pathogenic to humans [53]. Here we showed that WGS enabled comparisons across isolates to establish phylogeny, helped in identification of antibiotic resistance by monitoring the presence of antimicrobial resistance genes, and determined the presence of known virulence genes that have been linked with illnesses. A freely accessible dataset of high-quality reference genome sequences of FRFDA was previously unavailable. Future food safety investigations will benefit from the comparisons made possible by this WGS dataset as it allows for the monitoring of the recurrence and emergence of strains in the food supply. It is our goal to help develop a database that will allow for fast source tracking and accurate categorization (low risk or high risk) of STEC food isolates in a more comprehensive manner.
Supporting information S1  Laboratories who diligently isolated and sequenced many of the isolates here within analyzed. We also want to thank Sabina Lindley for assistance in sequencing some of these E. coli strains and Lili Fox Vélez for her helpful comments on this manuscript.