Major role of the high-pathogenicity island (HPI) in the intrinsic extra-intestinal virulence of Escherichia coli revealed by a genome-wide association study

The bacterium Escherichia coli is not only an important gut commensal, but also a common pathogen involved in both diarrheic and extra-intestinal diseases. To characterize the genetic determinants of extra-intestinal virulence we carried out an unbiased genome-wide association study (GWAS) on 234 commensal and extra-intestinal pathogenic strains representative of the species phylogenetic diversity, tested in a mouse model of sepsis. We found that the high-pathogenicity island (HPI), a ~35 kbp gene island encoding the yersiniabactin siderophore, is highly associated with death in mice, surpassing all other genetic factors by far. We validated the association in vivo by deleting key components of the HPI in strains in two phylogenetic backgrounds, and found that virulence is correlated with growth in the presence of various compounds including several antimicrobials, which hints at collateral sensitivities associated with intrinsic pathogenicity. This study points at the power of unbiased genetic approaches to uncover virulence determinants and the use of phenotypic data to generate new hypothesis on pathogenicity and phenotypic characteristics associated with it.

The bacterium Escherichia coli is not only an important gut commensal, but also a common pathogen involved in both diarrheic and extra-intestinal diseases. To characterize the genetic determinants of extra-intestinal virulence we carried out an unbiased genome-wide association study (GWAS) on 234 commensal and extra-intestinal pathogenic strains representative of the species phylogenetic diversity, tested in a mouse model of sepsis. We found that the high-pathogenicity island (HPI), a ~35 kbp gene island encoding the yersiniabactin siderophore, is highly associated with death in mice, surpassing all other genetic factors by far. We validated the association in vivo by deleting key components of the HPI in strains in two phylogenetic backgrounds, and found that virulence is correlated with growth in the presence of various compounds including several antimicrobials, which hints at collateral sensitivities associated with intrinsic pathogenicity. This study points at the power of unbiased genetic approaches to uncover virulence determinants and the use of phenotypic data to generate new hypothesis on pathogenicity and phenotypic characteristics associated with it.

Introduction
Escherichia coli is both a commensal of vertebrates 1 and an opportunistic pathogen 2 involved in a wide range of intestinal and extra-intestinal infections. Extra-intestinal infections in humans represent a considerable burden 3 , bloodstream infections (bacteraemia) being the most severe with a high attributable mortality of between 10-30% 4-6 . The regular increase over the last 20 years of E. coli bloodstream incidence 7 and antibiotic resistance 8 is particularly worrisome. The factors associated with high mortality are mainly linked to host conditions such as age, the presence of underlying diseases and to the portal of entry, with the urinary origin being more protective. These factors outweighing those directly attributable to the bacterial agent 4-6,9 . Nevertheless, the use of animal models has shown a great variability in the intrinsic extra-intestinal virulence potential of natural E. coli isolates. In a mouse model of sepsis where bacteria are inoculated subcutaneously, it has been clearly shown that the intrinsic virulence quantified by the number of animal deaths over the number of inoculated animals for a given strain is dependant on the number of virulence factors such as adhesins, toxins, protectins and iron capture systems [10][11][12][13] . One of the most relevant virulence factors is the so-called high-pathogenicity island (HPI), a 36 to 43 kb region encoding the siderophore yersiniabactin, a major bacterial iron uptake system 14 .
The deletion of the HPI results in a decrease in the intrinsic virulence in the mouse model but in a strain-dependent manner [13][14][15][16] , indicating complex interactions between the genetic background of the strains and the HPI.
The limitation of these gene KO studies is that they target specific candidate genes. Recently, the development of new approaches in bacterial genome-wide association studies (GWAS) [17][18][19][20] allows searching in an unbiased manner for genotypes associated to specific phenotypes such as drug resistance or virulence. In this context, we conducted a GWAS in 234 commensal and extra-intestinal pathogenic strains of E. coli , representing the species phylogenetic diversity, to search for traits associated to virulence in the mouse model of sepsis 21 . The strains belong to three large strain collections that span the species' major phylogroup diversity; the ECOR 22 , IAI 10 and NILS 23 collections. All three collections contain commensal as well as extra-intestinal pathogenic E. coli (ExPEC), being defined as strains that possessed currently recognized extra-intestinal virulence factors and/or demonstrated enhanced virulence in an appropriate animal model of extra-intestinal infection 24 . Most importantly, strains from these collections have been recently sequenced and phenotyped across hundreds of growth conditions, including antibiotics and other chemical and physical stressors 25 . This data could then be used to find phenotype associations with virulence and to generate hypotheses on the function of genetic variants associated with the ExPEC phenotype and potential collateral sensitivities associated with them.

GWAS identifies the high-pathogenicity island as the strongest driver of the extra-intestinal virulence phenotype
We studied three strain collections representative of the E. coli sensu lato phylogenetic diversity, i.e., Escherichia clade I in addition to phylogroup A, B1, C, D, E and F strains 26 . These strains encompass 90 commensal strains and 144 strains isolated in various extra-intestinal infections, mainly urinary tract infections and septicaemia 10,22,23 . To avoid any bias linked to host conditions, we assessed the strain virulence as its intrinsic extra-intestinal pathogenic potential using a well-calibrated mouse model of sepsis 10,21 , expressed as the number of killed mice over the 10 inoculated per strain. In accordance with previous data, phylogroups B2, D and F have a higher proportion of virulent strains, as compared to phylogroups A and B1 ( Figure 1A, Supplementary Table  1).
We used a bacterial GWAS method to associate k -mers to the virulence phenotype, allowing us to simultaneously test the contribution of core and accessory genome variation to pathogenicity 19 . It is generally understood that such methods require large sample sizes to have sufficient power, partly due to the need to break the long clonal frames typical of bacterial genomes; the appropriate sample size is also a function of the penetrance of the associated variants 18,27 . We ran simulations with an unrelated set of complete E. coli genomes and verified that our sample size was appropriate for variants with high penetrance (i.e. odds ratio above 5, Supplementary Figure 1, Methods). We reasoned that the genetic determinants of virulence are likely to have a relatively high penetrance, and that the strains used were genetically diverse, enough to break the clonal frame.
We uncovered a statistically significant association between 47,598 k -mers and the virulence phenotype, which were mapped back to 86 genes across the strains' pangenome ( Figure 1B, Methods). A separate association using genes' presence absence patterns showed that the genes to which the associated k -mers mapped to have an odds ratio that far exceeds the required threshold we estimated from simulations ( Figure 1C). Since the average minimum allele frequency (MAF) of associated k-mers is consistently around 36% ( Figure 1B) and the distance between the genes they map to across all strains is around 1 kbp ( Figure 1D), we concluded that the virulence phenotype is associated to the presence of a gene island. In fact, all genes belonging to the HPI had the vast majority of associated k -mers mapped to them (normalized hits >= 0.1, Figure 1E). Moreover, we found that the HPI structure was highly conserved across the 151 genomes that encode it (Supplementary Figure 2). We also observed that the distribution of known virulence factors doesn't match the virulence phenotype as closely as the HPI or has k -mers passing the association threshold, further reinforcing the association results that the HPI is one of the main genetic factors behind virulence across phylogroups (Supplementary Figure 3).

Figure 1. The HPI is strongly associated with the extra-intestinal virulence phenotype assessed in the mouse sepsis assay.
A) Core genome phylogenetic tree of the E. coli strains used in this study rooted on Escherichia clade I strains. Outer ring reports virulence as the number of killed mice over the 10 inoculated per strain, inner ring the phylogroup each strain belongs to. B) Results of the k-mer association analysis: for each gene the minimum association p-value and average minimum allele frequency (MAF) across all mapped k-mers are reported. The normalized hits are computed by dividing the number of mapped kmers by the length of the gene. C) Results of the gene association analysis; each gene tested is represented. Genes from the k-mer association analysis are highlighted in red. D) The associated genes (normalized hits >= 0.1) belong to a gene cassette. OGs: orthologous groups. E) The HPI gene cassette structure in strain IAI39; all associated genes are highlighted.

KO gene experiments validate the role of the HPI in the extra-intestinal phenotype
The studies on the role of the HPI in experimental virulence gave contrasting results according to the strains' genetic background 13 . Among B2 phylogroup strains, HPI deletion in the 536 (ST127) strain did not have any effect in the mouse model of sepsis 28 whereas this deletion in the NU14 (ST95) strain dramatically attenuated virulence 13 . Two strains from this study belonging to B2 phylogroup/ST141 (IAI51 and IAI52) deleted in irp1 have attenuated virulence in the same model 15 . To have a broader view of the role of the HPI in various genetic backgrounds, we constructed irp2 deletion gene mutants in two strains of phylogroup D (NILS46) and A (NILS9) belonging to STs frequently involved in human bacteraemia (ST69 and ST10, respectively) 29 . We first verified that the wild-type strains strongly produced yersiniabactin, whereas the irp2 mutants did not ( Figure 2A). We then tested them in the mouse sepsis model and saw an increase in survival for both mutated strains (log-rank test p-value < 0.0001 and 0.0217 for strain NILS9 and NILS46, respectively, Figure 2B, Supplementary Table 2) with no significant difference between the survival profiles for the two mutants (log-rank test p-value > 0.1). We have therefore validated in vivo the causal link between the HPI and the virulence phenotype detected by the means of an unbiased association approach, which demonstrates the power and accuracy of bacterial GWAS.

High-throughput phenotypic data sheds light on HPI's function
The main function encoded by the HPI cassette is iron scavenging through the expression of the siderophore yersiniabactin 15 , which has been previously validated in E. coli through knockout experiments 13 . In order to investigate other putative functions, we leveraged a previously-generated high-throughput phenotypic screening in an E. coli strain panel that largely overlaps with the strains used here (186 over 234) 25 . We observed a relatively high correlation between growth profiles in certain conditions and both virulence and presence of the HPI cassette ( Figure 3A and 3B, Supplementary Table 3); given the strong association between the two, we observed similar conditions being correlated.
As expected, we found a correlation between growth on the iron-sequestering agent pentetic acid 30 and both HPI presence and virulence (Pearson's r: 0.47 and 0.36, respectively). We similarly, observed a correlation between pyocyanin, a redox-active phenazine compound being able to reduce Fe 3+ to Fe 2+ 31 , and both HPI presence and virulence (Pearson's r: 0.36 and 0.29, respectively) Interestingly, we found similarly strong correlations with growth on sub-inhibitory concentrations of several antibiotics, such as rifampicin, ciprofloxacin, amoxicillin and oxacillin, as well as other antimicrobial agents such as cerulenin and colicin. This might be due to the presence of resistance alleles and/or genes that are strongly associated with pathogenic strains, or might point to the role of iron homeostasis in intrinsic resistance to antibiotics 32 . As an example, quinolones bind Fe 3+ on its pyridione ring, which is also involved in the interaction with its target, DNA gyrase 33 . Cell envelope permeability can also be modified in response to the presence of iron via two-component systems, rendering the cell more resistant 32 . On the other hand we found that growth in presence of indole at 2 mM in association with sub-inhibitory concentrations of cefsulodin and tobramycin antibiotics, but not with each of these compounds alone, was negatively correlated with both HPI presence and virulence. This might indicate a synergy between antibiotics and indole. In lysogeny broth, sub lethal concentrations of antibiotics increased the endogenous production of indole by the cells 34 and, at very high concentration (5 mM), indole induces the production of reactive oxygen species and is toxic for the cells 35 . This toxicity has been shown to be partly iron mediated due to the Fenton reaction 36 , explaining that cells with increased import of extracellular iron due to the HPI might be more sensitive to these compounds. These associations suggest the potential for collateral sensitivities related to both intrinsic pathogenicity and the presence of the HPI.
Given the relatively large number of conditions correlated with both pathogenicity and HPI presence, we tested whether both features could be predicted from growth data. We used the commonly-used random forests machine learning algorithm with appropriate partitioning of input data to tune hyperparameters and reduce overfitting, leading to two classifiers for virulence and presence of the HPI cassette with high predictive power ( Figure 3C and 3D and Methods). We noted that prediction of HPI presence performs slightly better than virulence, possibly reflecting the complex nature of the latter phenotype. As expected, we found that conditions with relatively high correlation with both features have a higher weight in both classifiers ( Figure 3E, Supplementary Table 4), which suggests that a subset of phenotypic tests might be sufficient to classify pathogenic strains. These results show how phenotypic data can be used to generate hypotheses over gene function and pathogenesis.

Discussion
With the steady decline in the price of genomic sequencing and the increasing availability of molecular and phenotypic data for bacterial isolates, it has finally become possible to use statistical genomics approaches such as GWAS to uncover the genetic determinants of relevant phenotypes. Such approaches have the advantage of being unbiased, and can then be used to confirm previous targeted findings and potentially uncover new factors, given sufficient statistical power. The accumulation of other molecular and phenotypic data can on the other hand uncover variables correlated with phenotype, which can be used to generate testable hypothesis on the function of genomic hits and potential collateral sensitivities associated with them. Given the rise of both E. coli extra-intestinal infections and antimicrobial resistance, we reasoned that the intrinsic virulence assessed in a calibrated mouse model of sepsis 10,21 is a phenotype worth exploring with such an unbiased approach.
We were able to confirm earlier reports on the importance of the HPI in extra-intestinal virulence [13][14][15][16]37 , which showed the strongest signal in both the k -mer and accessory genome association analysis, and whose importance was validated in vitro and in an in vivo model. The distribution of the HPI within the species resulting from multiple horizontal gene transfers via homologous recombination 38 has probably facilitated its identification using GWAS. Additional genetic factors might have been overlooked by this analysis, due to the relatively small sample size; we however estimate that those putative additional factors might have a relatively low penetrance, based on our simulations in an independent dataset. As sequencing of bacterial isolates is becoming more common in clinical settings 39-41 , we expect to be able to uncover these additional genetic factors in future studies.
The association between both the intrinsic virulence phenotype and the presence of the HPI and previously collected growth data allowed us to generate testable hypotheses on mechanism of pathogenesis and putative additional functions of the HPI. In particular we observed a relatively strong correlation between growth on various antimicrobial agents and both pathogenicity and HPI presence, which confirms the pressure to acquire resistance for these isolates, but also on the potential role of HPI and iron homeostasis on antimicrobial resistance 32 . E. coli mutants of fur , a transcriptional regulator that represses iron uptake systems, which accumulate high level of intracellular iron, have been shown to increase resistance to quinolones, aminoglycoside, tetracycline, rifampicin and amoxicillin 42 . The negative correlation with growth profiles in the presence of the indole associated to antibiotics points to the possible deleterious role of iron in the effect of sublethal doses of antibiotics. A vicious circle is rapidly established as antibiotics increase the production of indole 34 , which in turn destabilise the membrane 35 , further increasing the penetration of the antibiotics. The deletion of TonB, an iron transporter, increase resistance to the antibiotic, showing the role of reactive oxygen species generated by the Fenton reaction in the presence of iron 36 . Altogether, these data bring new light on the "liaisons dangereuses" between iron and antibiotics that could potentially be targeted 32 .
We also demonstrate how growth data across several conditions can accurately distinguish pathogenic from non-pathogenic isolates, which could lead to the development of growth-based tests, which could complement and validate existing diagnostic tools based on molecular and phenotypic data [43][44][45] . Taken together this analysis demonstrates how a data-centric approach can increase our knowledge of complex bacterial phenotypes and guide further empirical work on gene function and its relationship to intrinsic pathogenicity.

Strains used
The full list of the 234 strains used in the association analysis, together with their main characteristics is reported in Supplementary Table 1. The genome sequences of all 234 strains is available through Figshare 46 .
The construction of the irp2 deletion mutants of the NILS9 and NILS46 strains was achieved following a strategy adapted from Datsenko and Wanner 47 . Primers used in the study are listed in Supplementary Table 5. In brief, primers used for gene disruption included 44-46 nucleotide homology extensions to the 5'-and 3' regions of the target gene, respectively, and additional 20 nucleotides of priming sequence for amplification of the resistance cassette on the template plasmids pKD4. The PCR product was then transformed into strains carrying the helper plasmid pKOBEG expressing the lambda red recombinase under control of an arabinose-inducible promoter 48 . Kanamycin resistant transformants were selected and further screened for correct integration of the resistance marker by PCR. For elimination of the antibiotic resistance gene, helper plasmid pCP20 was used according to the published protocol. PCR followed by Sanger sequencing of the mutants were performed to verify the deletion and the presence of the expected scar.

Yersiniabactin detection assay
Production of the siderophore yersiniabactin was detected and quantified using a luciferase reporter assay as described elsewhere 13,49 . Briefly, bacterial strains were cultivated in NBD medium for 24 hours at 37°C. Next, bacteria were pelleted by centrifugation and the supernatant was added to the indicator strain WR 1542 harbouring plasmid pACYC5.3L. All the genes necessary for yersiniabactin uptake are located on the plasmid pACYC5.3L, i.e. irp6, irp7, irp8, fyuA, ybtA . Furthermore, this plasmid is equipped with a fusion of the fyuA promoter region with the luciferase reporter gene. The amount of yersiniabactin can be quantified semi-quantitatively, as yersiniabactin-dependant upregulation of fyuA expression is determined by luciferase activity of the fyuA-luc reporter fusion. colony forming unit). Time to death was recorded during the following 7 days. Mice surviving more than 7 days were considered cured and sacrificed 10 . In each experiment, the E. coli CFT073 strain was used as a positive control killing all the inoculated mice whereas the E. coli K-12 MG1655 strain was used as a negative control for which all the inoculated mice survive 21 . For the mutant assays, 20 mice per strain were used to obtain statistical relevant data. The data was analysed using the lifeline package v0.21.0 50 .

Association analysis
All genome-wide association analysis were carried out using pyseer, version v1.2.0 19 . All input genomes were re-annotated using prokka, version v1.13.3 51 , to ensure uniform gene calls and excluding contigs whose size was above 200 base pairs. The core genome phylogenetic tree was generated using ParSNP 52 to generate the core genome alignment and gubbins v2.3.1 53 to generate the phylogenetic tree. The strain's pangenome was estimated using roary v3.12.0 54 . K-mers distributions from the input genome assemblies were computed using fsm-lite 18 , with a minimum and maximum k value of 9 and 100, respectively. The association between both k-mers and pangenome and phenotype (expressed as number of mice killed post-infection) was carried out using the FastLMM 55 linear mixed-model implemented in pyseer, using a kinship matrix derived from the phylogenetic tree as population structure. For both association analysis we used the number of unique presence/absence patterns to derive an appropriate p-value threshold for the association likelihood ratio test (2.90E -09 and 7.03E -06 for the k -mers and pangenome analysis, respectively). K -mers significantly associated with the phenotype were mapped back to each input genome using bwa mem v0.7.17-r1188 56 and betools v.2.27.1 57 , using the pangenome analysis to collapse gene hits to individual groups of orthologs. A sample protein sequence for each groups of orthologs where at least on k -mer with size 20 or higher was mapped was extracted giving priority to strain IAI39 when available, given it was the only strain with a complete genome available; those sample sequences where used to search for homologs in the uniref50 database from uniprot 58 using blast v2.7.1+ 59 . Each group of orthologs was then given a gene name using both available literature information and the results of the homology search. Distances between each pair of associated groups of orthologs was computed using the annotation files, using an equal number of random pairs as background.

Power simulations
Statistical power was estimated using an unrelated set of 548 complete E. coli genomes downloaded from NCBI RefSeq using ncbi-genome-download v0.2.9 on May 24th 2018. Each genome was subject to the same processing as the actual ones used in the real analysis (re-annotation, phylogenetic tree construction, pangenome estimation). The gene presence/absence patterns were used to run the simulations, in a similar way as described in the original SEER implementation 18 . Briefly, for each sample size, a random subset of strains was selected, and the likelihood ratio test p-value threshold was estimated by counting the number of unique gene presence/absence patterns in the reduced roary matrix. For each odds ratio tested, a binary case-control phenotype vector was constructed for the strains subset using the following formulae: Were is the ratio of case/controls (set at 1 in these simulations), as the S r AF M minimum allele frequency of the target gene in the strains subset, and the number of D e cases. pyseer's LMM model was then applied to the presence/absence vector of the target gene and the likelihood ratio test p-value was compared with the empirical threshold. The randomization was repeated 100 times and power was defined as the proportion of randomizations for each sample size and odds ratio whose p-value was below the threshold. The pks2 and fabG genes were used as gene targets in the simulations, and both gave very similar results.

Correlations with growth profiles
The previously generated phenotypic data 25 for 186 over 234 strains were used to compute correlations with both the number of mice killed after infection and presence/absence of the HPI. The data was downloaded from the ecoref website ( https://evocellnet.github.io/ecoref/download/ ) and the pearson correlation with the s-scores was computed together with the correlation p-value. Two predictors, one for virulence (number of killed mice post-infection) and one for presence of the HPI were built using the random forest classifier algorithm implemented in scikit-learn v.020.2 60 , using the s-scores as predictors. The input was column imputed, and 33% of the observations were kept as a test dataset, using a "stratified shuffle split" strategy. The remainder was used to train the classifier, using a grid search to select the number of trees and the maximum number of features used, through 10 rounds of stratified shuffle split with validation set size of 33% the training set and using the F1 measure as score. The performance of the classifiers on the test set were assessed by computing the area under the receiver operating characteristic curve (ROC-curve).  366  367  368  369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384  385  386  387  388 All input data and code used to run the analysis and generate the plots is available online at https://github.com/mgalardini/2018_ecoli_pathogenicity . Code is mostly based on the Python programming language and the following libraries: numpy v1. 16 Figure 3. Presence/absence patterns of known virulence factors other than genes belonging to the HPI. Blue indicates presence, light grey indicates absence. Phenotypes (number of killed mice) and phylogroup of each strain are reported as in Figure 1A.