Fig 1.
Expanding and Analyzing the E. coli panGEM.
A) Expansion of the E. coli panGEM. Building on our previous work with 2,377 fully sequenced E. coli strains, we generated 12,934 new genome-scale metabolic models (GEMs) from E. coli genomes available as contigs, expanding the panGEM to encompass a total of 15,311 (2,377 previously reconstructed + 12,934 new genomes) strain-specific GEMs. These GEMs underwent semi-automated curation and gap-filling to incorporate missing gene-to-protein-to-reaction (GPR) associations. B) Nutrient availability in colonization sites. Using metabolomics data from the HMDB database [32], we simulated nutrient availability in three key colonization sites where E. coli is known to grow and cause infections. Flux Variability Analysis (FVA) was performed to assess the utilization potential of available nutrients in each media environment. C) Metabolic shifts across colonization sites. Parsimonious flux balance analysis (pFBA) was applied to predict metabolic shifts in uropathogenic E. coli (UPEC) across the simulated colonization sites, highlighting differential metabolic behaviors. D) Pangenome-scale knock-out simulations across colonization sites. Single-reaction deletion analysis was conducted across the pangenome, generating over 22 million predictions to identify essential genes.
Fig 2.
Computed metabolic states across strains in the panGEM for growth in feces, urine, and serum.
(A) Overlap of consumable nutrients in feces, urine, and serum. The UpSet plot details the count of unique nutrients in each simulated condition, as well as the shared count of nutrients between pairs of conditions and among all three conditions. (B) Clustering of GEMs based on minimum reaction fluxes from FVA. The t-SNE plot illustrates the clustering of GEMs according to the minimum fluxes of their reactions during simulated growth in feces, urine, and serum. Clusters are highlighted with distinct colors corresponding to each media type (feces, serum, urine). The inset highlights UPEC-enriched and UPEC-free clusters. (C) This panel highlights the metabolic shifts that a pathogenic UPEC strain undergoes while growing in urine, feces, and serum. The metabolic maps illustrate the changes in fluxes through the TCA cycle, glycolysis, pentose phosphate pathway, ATP synthase, and the electron transfer chain, as calculated by pFBA for growth in feces, urine, and serum. The fluxes of different reactions are color-coded according to the legend. (D–I) Effect-size–aware volcano plots of reaction-level flux differences between UPEC and non-UPEC isolates across three media. Subpanels D–F show FVA center (mean of min/max flux) for urine, serum, and feces, respectively; G–I show FVA span (max–min) for the same media. Each point is a reaction (x-axis: Hedges’ g for UPEC − non-UPEC; y-axis: − log₁₀(FDR) from two-sided Mann–Whitney U with Benjamini–Hochberg correction). Positive g indicates higher flux in UPEC; negative indicates higher in non-UPEC. Grey points are non-interpretable; colored points mark the top interpretable hits within each subpanel meeting all criteria: FDR ≤ 0.05, |g| ≥ 0.5, and |Cliff’s δ| ≥ 0.33. Legends are per subpanel and list the highlighted reactions.
Fig 3.
Anabolic-Catabolic Balance of the NADPH Pool in UPEC Metabolism Across Urine, Feces, and Serum.
(A) Biomass–medium overlap. Binary heatmap showing the presence of biomass objective metabolites (columns) in each sample medium (rows: Serum, Urine, Feces). Columns are ordered by decreasing frequency of presence across samples to emphasize commonly shared metabolites. The right annotation column reports the total number of biomass metabolites present in each sample.(B) Predicted NADPH–producing and –consuming reactions for UPEC E. coli JJ1887 (ST131) in three media representing feces, serum, and urine. Values are parsimonious FBA fluxes (mmol gDCW ⁻ ¹ h ⁻ ¹). The “Direction” column indicates the physiological direction used for reporting: NADPH → NADP⁺ = NADPH consumption, NADP⁺ → NADPH = NADPH production. Only reactions with non-zero flux in at least one environment are shown; blue cells denote zero flux in that environment. (C) Conceptual summary of the environment-specific routing of reducing power. Reactions highlighted correspond to flux-carrying NADPH nodes from panel B, illustrating shifts in anabolic redox demand and catabolic by-products (e.g., formate) across feces, serum, and urine. Abbreviations: UPEC, uropathogenic E. coli; mmol gDCW ⁻ ¹ h ⁻ ¹, millimoles per gram dry cell weight per hour.
Fig 4.
Validation of gene knock-out predictions.
Knock-out predictions were evaluated against published M9 knock-out data for 850 metabolic genes across 12 E. coli strains [15–17,20,27,39]. Eleven GEMs were reconstructed from draft (contig) genomes, whereas MG1655 derives from a complete genome and shows the lowest error rate (accuracy = 93.81%, precision = 93.85%). (A) Phylogroup composition of the 12 strains used for KO validation. (B) Confusion matrix comparing model predictions (growth/no growth) with KO outcomes; cells report counts and proportions of true negatives (TN), false positives (FP), false negatives (FN), and true positives (TP). (C) Aggregate validation metrics (accuracy, precision, F1, and Matthews correlation coefficient). (D) Detailed comparison of experimental KO outcomes versus model predictions across all tests, shown as a matrix with genes as columns and strains as rows; gene names are annotated. (E) Zoomed view of rare-essential genes across genomes, grouped by the reaction they encode, showing predicted versus experimental outcomes. White cells indicate that the gene is not rare in that genome; colors denote TN (blue), TP (green), FN (orange), and FP (red). panel footer summarizes counts and derived metrics, and the fraction of strains carrying the rare-essential gene.
Fig 5.
Experimental verification of panGEM predictions for carbon-source utilization.
(A) Distribution of the 67 E. coli strains across phylogroups used for validation. (B) Intra-phylogroup diversity of the validation set. Annotations indicate phylogroup, Mash cluster (distance threshold 0.02), genome count, and the fraction of genomes (%) assigned to each cluster. (C) Heat map comparing model predictions with BioLog growth/no-growth outcomes across 96 carbon sources for genome-scale models (GEMs) of 67 strains (59 assessed previously and 8 newly profiled here). A version annotated with compound names is provided in S3 Fig. (D) Confusion matrix for the previously assessed strains. (E) Confusion matrix for the strains validated in this study. (F) Confusion matrix for the combined set (previous + current). Panels D–F report counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) for model predictions relative to experimental outcomes. (G) Weighted performance statistics (precision, recall, and F1) for the previously assessed, newly assessed, and combined sets. Color scheme used throughout: purple, TN; orange, FN; yellow, FP; green, TP.
Fig 6.
Pangenome scale knock-out simulation across different media, and the identification of globally essential metabolic reactions .
(A) Single reaction deletion analysis across 2377 strains in four different media resulted in 22.4 million knock-out phenotypes, revealing essential genes of E. coli, these essential genes are grouped based on their pangenome category, Illustrating how the distinct genes in each category contribute to the number of essential phenotypes. (B) Average fitness scores of strain-specific metabolic networks across different media. Each dot represents a strain’s metabolic network, with the Y-axis indicating the count of essential reactions (fitness >95%) per strain and the X-axis showing the average fitness score of each strain’s metabolic network. Annotations highlight the media in which fitness scores were predicted. (C) UpSet plot illustrating the number of media-dependent essential reactions and their overlaps across simulated media. Reactions that are essential across all media and consistent across all strains, referred to as globally essential reactions, are highlighted in red. (D) Reactions Genetic Basis Diversity vs Essentiality: Each box represents an essentiality category (Essential, Conditionally Essential, and Non-Essential). Dots indicate individual reactions within each category. The Y-axis displays the genetic basis diversity. Other statistics are presented in the legend table. (E) Fitness Scores of Rare Genes Across Media: The density plot illustrates the distribution of fitness scores for rare essential genes, color-coded by simulated media. The X-axis represents fitness scores (0 = non-essential, 100 = essential).
Fig 7.
Most Frequent Essential Reactions Encoded by Rare Genes.
(A) The scatter plot shows the number of distinct rare essential genes versus the count of strains in which they are present. The highlighted data points represent reactions encoded by widely distributed rare essential genes, specifically Anthranilate synthase (ANS), 1-deoxy-D-xylulose 5-phosphate synthase (DXPS), and Nicotinate-nucleotide diphosphorylase (NNDPR). (B) Top 10 pathways by count of rare-essential alleles. Bars show the absolute number of rare essential alleles mapped to each pathway (annotated at bar ends) and are ordered from highest to lowest. (C) The metabolic map illustrates the biosynthesis pathway of L-tryptophan. Anthranilate synthase (ANS), the reaction most frequently coded by rare essential genes, is highlighted to illustrate its diverse genetic basis by mapping its panGPR against the pangenome. The categories “Core,” “Rare,” and “Accessory” indicate the number of strains coding for the ANS reaction by each gene category. The red shading in the metabolic pathway highlights reactions encoded by rare essential genes across strains. The intensity of the shading reflects the prevalence of these rare essential genes in encoding the respective reactions across strains. The blue shading highlights reactions that contain no rare essential genes, indicating that these reactions are non-essential.