Comparative genomic analysis and molecular examination of the diversity of enterotoxigenic Escherichia coli isolates from Chile

Enterotoxigenic Escherichia coli (ETEC) is one of the most common diarrheal pathogens in the low- and middle-income regions of the world, however a systematic examination of the genomic content of isolates from Chile has not yet been undertaken. Whole genome sequencing and comparative analysis of a collection of 125 ETEC isolates from three geographic locations in Chile, allowed the interrogation of phylogenomic groups, sequence types and genes specific to isolates from the different geographic locations. A total of 80.8% (101/125) of the ETEC isolates were identified in E. coli phylogroup A, 15.2% (19/125) in phylogroup B, and 4.0% (5/125) in phylogroup E. The over-representation of genomes in phylogroup A was significantly different from other global ETEC genomic studies. The Chilean ETEC isolates could be further subdivided into sub-clades similar to previously defined global ETEC reference lineages that had conserved multi-locus sequence types and toxin profiles. Comparison of the gene content of the Chilean ETEC identified genes that were unique based on geographic location within Chile, phylogenomic classifications or sequence type. Completion of a limited number of genomes provided insight into the ETEC plasmid content, which is conserved in some phylogenomic groups and not conserved in others. These findings suggest that the Chilean ETEC isolates contain unique virulence factor combinations and genomic content compared to global reference ETEC isolates.


Introduction
The pathogenic variant (pathovar) of Escherichia coli known as enterotoxigenic E. coli (ETEC) has been implicated in 1 billion cases of diarrhea annually [1], and recent studies, such as the Global Enteric Multicenter Study (GEMS), has further confirmed that ETEC is a significant global pathogen [2]. ETEC are defined on a molecular basis by the presence of genes that encode the heat-stable (ST) and/or heat-labile (LT) enterotoxin [3,4]. Although a number of virulence factors identified in ETEC isolates have been investigated as potential vaccine antigens, to date no effective vaccine has been developed most likely due to antigenic variation of key virulence factors [5][6][7][8]. One goal of microbial genomic studies is to identify novel antigens that may be used as alternate vaccine targets [9,10]; however characterization of the genomic and antigenic diversity of these pathogens is required prior to selecting these novel antigenic targets. Genome sequencing of ETEC isolates to date has revealed significant diversity not previously observed in the other E. coli pathovars [11][12][13][14][15][16][17]; however, previous studies have not extensively characterized the genomic diversity of Chilean ETEC isolates.
While the enterotoxins are the defining molecular and virulence features of ETEC, the majority of isolates also express one or more colonization factors (CFs), a heterologous group of antigens that promote attachment to intestinal epithelia [18,19]. Most ETEC-specific virulence factors, including the CFs are plasmid-encoded. At least 27 known or putative CFs have been described in the literature [18,19] and more CFs have been identified in genomic studies [20], but have yet to be functionally characterized [13,21]. In addition to the canonical ETEC virulence factors and CFs, additional virulence factors have been identified with variable presence in ETEC isolates. Two plasmid-borne loci encode EatA, a secreted serine protease autotransporter [22], and EtpA, a glycoprotein that acts as an adhesive bridge between ETEC flagella and host surface structures [23], seem to be more broadly conserved in ETEC than previously identified putative virulence factors [23,24]. While functional characterization of the contribution of EatA and EtpA ETEC virulence is ongoing, EatA appears to accelerate toxin delivery by degrading MUC2, the major mucin secreted by gastrointestinal goblet cells [25], and EtpA modulates adhesion to blood group antigens [26,27]. Additionally, both EtpA and EatA are immunogenic in humans [28] and have been demonstrated to be protective antigens in animal models [25,29,30].
The advent of whole genome sequencing has opened the possibility of examining large numbers of ETEC genomes from isolates collected over time from different geographic locations to increase our understanding of the dynamic nature of these organisms. Until 2014, there were fewer than 10 sequenced and assembled human-associated ETEC isolates available in Genbank, and all were from symptomatic patients, [11,12,14]. However, recent studies by von Mentzer et al. [15], Del Canto et al. [21] and Sahl et al. [13] as well as others [16,17,31] have greatly expanded the genomic knowledge of ETEC as a pathovar, by including ETEC from various global geographic locations and clinical presentations. While there have been a number of molecular studies of the ETEC in Chile [24,[32][33][34][35][36], only a limited number of genomes of Chilean ETEC isolates have been examined [21]. The current study examines a collection of 125 diarrhea associated ETEC isolates from three geographic locations in Chile to begin to address the gap in our knowledge of the genomic and virulence factor diversity of ETEC in Chile. This study also takes advantage of the ability to compare traditional PCR and other typing assays with in silico analyses based on the whole genome sequencing.

Bacterial isolates and media
The ETEC isolates examined in this study were obtained via targeted ETEC surveillance at three locations in Chile: Santiago (88 isolates), Antofagasta (31 isolates) and Calama (6 isolates). All isolates were determined by PCR to be either ETEC LT or ST positive using a previously validated PCR assays [37]. The isolates were grown, with minimal passage, in Lysogeny Broth (LB)[38](Difco) for genomic DNA isolation and propagation.

Genome sequences
The genomes of the isolates were generated, sequenced and analyzed as previously described [39]. The 150bp sequencing reads from the Illumina platform [39] were assembled using spades v.3.7.1 with careful mismatch correction [40] and the assemblies were filtered to contain only contigs �500bp with �5X k-mer coverage [41]. The assemblies were further examined for characteristics that would suggest the genome was of high quality (<400 contigs) and potentially E. coli (%GC~50% and genome size between 4.7-5.4 Mb). The assembly metrics and corresponding GenBank accession numbers are provided in S1 Table. Four isolates were further sequenced with Pacific Biosciences (PacBio) to complete the genomes as representatives of isolates from Chile, as well as to completement the existing genome references. PacBio raw data was corrected and assembled as previously described [42,43] The final assembly statistics for these genomes are included in S1 Table.

Multilocus sequence typing and serotype identification
The seven loci (adk, gyrB, fumC, icd, mdh, purA, and recA) of the multilocus sequence typing (MLST) scheme developed by Wirth et al. [44] were identified and compared with the database maintained by the University of Warwick (http://mlst.warwick.ac.uk/mlst/dbs/Ecoli). The MLST gene sequences extracted from each genome were used to query the BIGSdb database [45] to obtain the allele numbers and sequence type of each ETEC genome analyzed. In silico serotype identification was performed on the assembled genomes using the online Serotype-Finder 1.1 (https://cge.cbs.dtu.dk/services/SerotypeFinder/).

Phylogenomic analysis
The genomes of the ETEC isolates analyzed in this study were compared with 73 previously sequenced E. coli and Shigella genomes (S2 Table) using the In Silico Genotyper (ISG) [46,47] as previously described [41][42][43]. Single nucleotide polymorphisms (SNPs) were detected relative to the completed genome sequence of the phylogroup F laboratory isolate E. coli IAI39 (NC_011750.1). A total of 221,978 conserved SNP sites, which were present in all of the genomes analyzed, were concatenated into a representative sequence for each genome. A maximum-likelihood phylogeny was inferred using the GTR model of nucleotide substitution with the GAMMA model of rate heterogeneity, and 100 bootstrap replicates, and visualized using FigTree v1.4.2 (http://tree.bio.ed.ac.uk/software/figtree/).

Comparisons of genome content
The genomes of the 73 reference E. coli isolates and the 125 Chilean ETEC isolates (S1 Table) were compared de novo using Large Scale-BLAST Score Ratio (LS-BSR) analysis [39,48]. The resulting output was used to identify genes that exhibited an altered distribution among the isolates that were examined. The LS-BSR values and the nucleotide sequences of each gene cluster for the 125 new ETEC Chilean isolates are included in Supplemental Data Set 1.
The presence or absence of the ETEC virulence genes were examined using BLAST Score Ratio [49] as previously described [13,39,41,50,51]. The genes that were detected with a BSR �0.8 were considered to be present in a genome as previously described [13,39,41,50,51].

Phylogenetic analysis of ST genes
The ST genes from each ETEC genome were compared with previously described estA reference sequences [52]. The estA nucleotide sequences were aligned using ClustalW and a phylogeny was constructed using the maximum-likelihood method with the Kimura 2-parameter model and 1,000 bootstraps using MEGA7 [53].

In silico detection of ETEC virulence plasmids
Plasmids in each of the complete genomes were annotated using an in-house annotation pipeline with gene prediction using Prodigal [54][55][56]. The predicted protein-coding genes of select plasmids were detected in each of the ETEC genomes using BLASTN LS-BSR. Heat maps were generated using the heatmap2 function of gplots v. 3.0.1 in R v.3.3.2 and were clustered using the complete linkage method with Euclidean distance estimation.

Identification of EtpA and EatA by immunoblotting
Supernatants of overnight bacterial cultures were precipitated with trichloroacetic acid (TCA) as previously described [23] and detected via Western blot as previously described for EatA [22] or EtpA [30].

Identification of etpA, etpB and eatA by polymerase chain reaction
Isolates encoding eatA, etpA or etpB were identified by PCR as previous described [23,33]. The resulting amplicon was electrophoresed on a 1% gel and visualized. A strain was positive when an amplicon of the appropriate size was visualized.

ETEC isolates
The ETEC isolates were obtained from three hospitals in Chile and were determined to be ETEC based on the presence of at least one of the heat-labile (LT) or heat-sable toxins (STh or STp) at the location of isolation. Of the examined isolates, 37 were obtained from a recent ETEC outbreak in the Antofagasta region of Chile in the cities of Antofagasta and Calama [24] [57]. The remaining 88 isolates are from an ETEC collection maintained in the Vidal Lab that was obtained from an ETEC targeted surveillance study in the community of Santa Julia (Santiago city).

Genome characteristics of isolates
The assembled ETEC genomes on average contained 203 ± 53 contigs. The mean genome size and %GC were 5,036,883 ± 140,015 bp (range 4,708,441-5,479,008 bp) and 50.56 ± 0.13 (range 50.02-50.83), respectively, which are both well within the normal variation of E. coli genome size and %GC. Further analysis was carried out on these assembled genomes, with the exception of four isolates that underwent additional PacBio sequencing to complete their genome assemblies, in which case the complete genomes were used in the analysis (Table 1).

Phylogenomic analysis of Chilean ETEC isolates
The genomes of the Chilean ETEC isolates were compared to selected draft and complete reference ETEC genomes of ETEC from a global collection representing defined abundant ETEC lineages L21L10 [15], a collection of previously sequenced ETEC from Bangladesh [13], and reference E. coli that are commonly used to define the E. coli pathovars and phylogenomic groups [39,50]. The ETEC L1 to L21 lineage reference isolates were selected for comparison based on the dominant MLST sequence types, serotypes and colonization factor profiles from von Mentzer, et. al. [15]. The phylogenomic analysis revealed that 80.8% (101/125) of the Chilean ETEC isolates are in phylogroup A (red highlighting), 15.2% (19/125) in phylogroup B1 (blue highlighting) and 4.0% (5/125) are in phylogroup E (Fig 1). Of the 101 Chilean ETEC isolates in phylogroup A, 27 are in Lineage L6, 40 are in lineage L1/L2 and the remaining 34 are distributed in smaller groups some with lineage references and others as phylogenomic singletons (Fig 1). When further molecular studies of these ETEC genomic lineages are undertaken, it is clear that these lineages represent expansions of successful clones, as all the isolates in Lineage L6 are also ST3223, and 25/27 (92.6%) contain only the STh toxin, with the remaining isolates (2/27) contain both LT and STh (Fig 1 and S1 Table). This relative expansion of the phylogroup A isolates in Chile is not due to the examination of a localized outbreak as there are isolates from each of the three geographic sites in Lineage L6. In other genomic studies coli and Shigella genomes listed in S2 Table using a single nucleotide polymorphism (SNP)-based approach as previously described [46,47]. SNPs were detected relative to the completed genome sequence of the laboratory isolate E. coli IAI39 using the In Silico Genotyper (ISG) [47]. A total of 221,978 conserved SNP sites, which were present in all of the genomes analyzed, were concatenated into a representative sequence for each genome. A maximum-likelihood phylogeny with 100 bootstrap replicates was inferred using RAxML v.7.2.8 [72]. Isolates designated in red are from Chile, isolates designated in blue are the ETEC lineage references identified in von Mentzer et al [15], and isolates designated in black are reference E. coli and Shigella isolates representing other pathotypes and phylogenomic groups. The letters (A, B1, B2, D, E, and F) designate the E. coli and Shigella phylogroups that were previously defined [73,74]. Colored circles indicate the country of origin on the inner ring and the heat labile toxin (LT) and heat stable toxin (ST) status on the middle and outer rings respectively. The green arrows indicate the genomes that were completed using Pacific Biosciences sequencing. The scale bar represents the distance of 0.04 nucleotide substitutions per site.

Complete genome sequencing of Chilean ETEC isolates
In addition to the draft sequences generated for the aforementioned 125 Chilean isolates, four isolates were selected for complete genome sequencing with Pacific Biosciences. Each of these isolates were selected as they are heat stable toxin containing only, which was determined in the GEMS analysis to contribute significantly to severe diarrheal disease in children under five years of age [2,31]. Three of these isolates in phylogroup A, 10754a-1, 10802a and 11573a-1 can be considered clonal (Fig 1, green arrow), as the isolates are most closely related in the inferred phylogeny, whereas the isolate 2407-a is in phylogroup B1. For each isolate examined there is a single chromosome and 2 plasmids that contain many of the canonical ETEC virulence factors for each isolate (  [59], or CFA/ I, STh, eatA and etpA genes respectively. These two plasmids from ETEC isolate 10802-a were examined for the plasmid gene distribution among the Chilean ETEC isolates (Fig 2). These studies identified that there was a limited number of isolates that appeared to contain the majority of the plasmid genes and additional isolates contained components of the plasmids including the virulence factors, but not the complete plasmid (Fig 2). In the fourth isolate, 2407-a, a plasmid of 135,437 bp contains the STh gene, CS5 and CS6 colonization factor gene clusters, as well as eatA and the previously identified plasmid encoded regulator, csvR [60] ( Table 1). An additional plasmid of 9,863 bp was identified in the 2407-a isolate but contains no known virulence genes. These plasmids were less conserved in the examined isolates suggesting that these plasmids are not common. These plasmid findings highlight the variability of the ETEC plasmids as previously described [11,43,61].  Fig 2. Distribution of virulence plasmids among the Chilean ETEC isolates. Heat maps indicate the presence of the virulence plasmids A) p10802a_92 and B) p10802a_46 among the Chilean ETEC and reference ETEC isolates analyzed in this study. The predicted protein-coding genes of each plasmid were between genomic and traditional PCR-based analyses. Among the discordant isolates were 15 toxin-positive strains identified by in silico analysis not identified by the traditional PCR, and 13 toxin-positive isolates identified by PCR not identified in the genome assemblies. Plasmid loss is one reason why there may be an inconsistency with the PCR, but also variation of the toxins or variation of the genes and surrounding genomic regions targeted by the PCR assay.

Detection of toxin genes
In addition to the presence or absence of the ST encoding gene we examined the distribution of the different ST alleles as described in Joffre et al [52]. There was no geographic variation associated with the ST alleles in the examined isolates (S1 Fig). Overall, we do not observe a geographic component to the distribution of any of the ETEC toxin genes.

Detection of colonization factors
In addition to the toxin gene profile, the other factors that identify ETEC isolates are the colonization factors (CF), which have been the target of intense study as vaccine and therapeutic targets. As with the ETEC toxins, there are traditional PCR tests for the majority of the well characterized CFs [21,24,33,37], additionally, the in silico analysis has been expanded by including novel CF gene clusters that have been recently identified but are not yet included in the traditional PCR assay [13]. The CF detection rates are included in Table 3, and for the majority of the CF examined by both genome analysis and PCR-based approaches there is excellent concordance between the two assays. There were no Chilean ETEC isolates that contained CS4, CS13, CS14, CS15, CS18, CS22, CS26, CS27a, CS28a, CS28b, CS31a, CS30 or the novel CF identified from TW11833 [13], nor were the CFs identified that are traditionally associated with livestock (ETEC_K88_ab, ETEC_K88ac, ETEC_K99, ETEC_f41a, ETEC_f987P, ETEC_f17a). The distribution of the CFs from the Chilean isolates was not significantly different when compared to the global distributions of the CFs in others studies [13,15,31,62]. identified using BLASTN LS-BSR [48] as previously described [43]. Each row represents an individual genome that is labeled on the left side by its ETEC toxin content as having the heat labile toxin (LT), heat stable toxin (ST), both LT and ST, or neither LT nor ST. Each column represents a different proteincoding gene of the reference plasmid being compared. Virulence factors of interest are indicated by a red box. A dashed line box indicates a group of genomes that contain all or nearly all of the plasmid genes. https://doi.org/10.1371/journal.pntd.0007828.g002

Detection of non-canonical secreted antigens
In addition to whole genome sequencing the isolates were also interrogated by gene specific PCR and Western blotting for the putative vaccine candidates, EtpA [68] and EatA [25]. By PCR, the eatA gene was identified in 76% of isolates consistent with the genome data, which identified the gene in 68% of the isolates, and immunoblotting confirming EatA protein secretion in 72% of the isolates tested. The etpB, and etpA genes were identified in 71.9%, and 82% of isolates, respectively, with EtpA protein secretion verified in 75% of isolates tested (S1 Table). However, etpA was identified in the genomes of only two isolates, likely relating to difficulty in assembling the multiple repeat modules comprising the 5' end of the gene using short read technologies [13].

Functional characteristics of isolates
Traditional serotyping of ETEC has demonstrated that this pathovar is associated with many serotypes [69]. In the current study we have the opportunity to compare serotyping by traditional methods with an in silico serotyping method (https://cge.cbs.dtu.dk/services/ SerotypeFinder/) (S3 Table). Of the isolates that were examined by both serotyping methodologies (n = 89), the in silico and traditional methods were generally congruent for 66.3% (59/ 89). The majority of the inconsistencies were common in the trend of discordance, in that there were common mis-identifications between the two methodologies. There were 13 isolates (14.6%) identified as O114/O127 by traditional methods, but were predicted to be O128ab/ac by in silico serotyping, an additional eight isolates that were identified as O153 by traditional methods, but were predicted to be non-typeable by in silico methods, and nine other isolates with unique predicted in silico serotypes with discordant serotyping results. Of the identified serotypes, the O6 was the most prevalent O serotype (n = 39, 31.2%), H16 (n = 39, 31.2%) was the most prevalent H serotype, and the O6:H16 (n = 25, 20%) was the most common combination (S3 Table). The distributions of the O and H antigens identified in the current study were similar to that of a global examination of ETEC isolates from the literature that had identified ETEC isolates of the O6:H16 serotypes as among the most prevalent [69].

Comparison of total genome content
The total gene content of the Chilean ETEC isolates were compared using de novo LS-BSR. We identified 18,719 gene clusters, of which 3,806 were identified in all 125 isolates, comprising the core genome. This number of predicted genes was similar to previous estimations of the genome core of E. coli [12,15,70]. The LS-BSR analysis allows us to integrate the clinical parameters, location of isolation, as well as genomic factors (MLST or phylogroup) to identify genomic features that may be associated with these clinical parameters. There are no genes that are exclusive to isolates from one of the Chilean geographic locations when compared to the isolates from the other locations in Chile; however there were genes that were identified among a greater number of the ETEC isolates from certain geographic locations compared to the other geographic locations in Chile (S4 Table). We could identify 496, 456, or 94 gene clusters that were more or less prevalent among the isolates from Santiago, Antofagasta, or Calama, respectively (chi-squared test, p-value <0.0001) (S4 Table). Additionally, the examination of the Chilean ETEC isolates, as in previous studies [13][14][15]58], did not identify any genes that are exclusively detected in the genomes of phylogroups A or B1; however phylogroup E had 30 genes that were identified among all members of that phylogroup and absent in all other Chilean ETEC genomes of the other phylogroups (S5 Table). Additionally, there were 1190, 978, and 896 genes that were detected in in a greater proportion among the genomes of phylogroups A, B1 or E; respectively, compared to isolates in the other phylogroups (chi-squared test, p-value <0.0001, S5 Table). These genes represent multiple biological functions, including secretion systems, as well as potential virulence factors.

Discussion
This study describes the genomic content of ETEC isolates from three geographic locations within Chile. While other molecular studies have focused on the virulence and/or colonization factor distribution among the ETEC isolates in Chile [33, 36, 37], this is the first study to incorporate large-scale comparative genomics analyses. The distribution of the canonical virulence factors, including the CF types was not significantly different when compared to previous global studies [13,15,31,62]. However, the most striking finding from the study is that the majority of the Chilean ETEC isolates are within phylogroup A, and within that phylogroup multiple ETEC lineages could be readily identified that have similar toxin, MLST and phenotypic profiles, suggesting that these clones have been successful and have expanded within Chile (Fig 1). In each of the sub-lineages within phylogroup A there is representation from each of the three geographic locations, suggesting that one geographic location or clinical presentation is not driving the observed difference. Additionally, we were able to compare the Chilean isolates to reference ETEC lineages identified by von Mentzer et al. [15], and in each case where Chilean isolates formed a group with the lineage reference isolate(s) they shared similar molecular attributes such as toxin profile, colonization factor or MLST sequence type, suggesting that these are successful global ETEC lineages, which have expanded in Chile. For example, ETEC lineage L6 was identified by von Mentzer et al. [15], which contained only seven L6 isolates from a collection of 462 isolates (7/462, 1.5%), none of which were from Chile or South America, whereas in the current collection of 125 Chilean ETEC isolates, 27 isolates were identified in Lineage L6 (27/125, 21.6%). This represents a 14.4 fold greater detection of the ETEC L6 isolates in Chile compared to the rest of the world (Chi squared pvalue < 2.2e-16). The reasons behind the increased occurrence of this particular ETEC lineage in Chile is not clear and the current study focuses only on the pathogen attributes; however there are many factors to consider when thinking about the host pathogen interaction, including the host genome, as well as nutritional status and the microbiota of the individual, all of which may play a role in the clinical outcome. The representative ETEC isolates selected for genome completion by PacBio sequencing in the current study were from the major ETEC phylogroups (B1 and A), as well as containing the ST toxin genes, and they complement the complete genomes of the ETEC prototypes [11,12] and recent isolates [16,17,31,43]. Interestingly, the three phylogroup A isolates selected for complete genome sequencing were very closely related based on the phylogeny (Fig 1), and each isolate has the same plasmid profile, and a very similar virulence factor profile ( Table 1). While the early studies of ETEC plasmid diversity indicated that the plasmid content was extremely variable [13,14,60,61,67,71], more studies [31], including the current study indicate the presence of stable combinations of ETEC chromosome and plasmids. The reasons for the stability of these combinations and instability of other chromosome:plasmid combinations is not clear and will require further functional analyses to examine in detail.
Overall, this study demonstrates the utility of combining whole genome sequencing with the isolate and laboratory molecular data to provide a detailed view of the pathogen distribution within a country or geographic region. We have determined that the Chilean ETEC isolates primarily are present in phylogroup A at an almost four fold greater proportion when compared to a global study of ETEC [15], whereas ETEC from other geographic locations are more distributed between phylogroups A and B1. How this impacts clinical outcomes or ETEC transmission is not yet clear, but provides a foundation to build upon for future studies to examine host genomics or other environmental factors that may be important in the study of this human pathogen.

S1 Fig. Phylogenetic analysis of STh and STp.
Nucleotide sequences of STh and STp genes were aligned with the estA1-estA7 reference sequences and used to infer a maximum likelihood phylogeny as previously described [43]. Labels of the Chilean ETEC isolates are colored by their location (see inset legend), while labels of reference ETEC are indicated in black. The estA1-estA7 reference sequences [52] are indicated in bold. The tree scale indicates the distance of 0.04 nucleotide substitutions per site. Bootstrap values �50 are indicated by gray circles. (EPS) S1