Fig 1.
Overview of the approach employed in this study.
Fig 2.
A subset of Salmonella genes are strongly indicative of invasive potential.
A: Out-of-bag votes for phenotype of each serovar cast by each model. Model 1 is the model built using all predictor variables, then each successive model was built using sparsity pruning from the previous model’s predictor variables. Model 5 is the final model with 100% accuracy. Out-of-bag votes include only those votes cast by trees that were not trained on a given sample. The dashed grey line indicates the voting threshold to classify an isolate as invasive. Invasive serovars are coloured in red and gastrointestinal serovars are coloured in blue. B: Of all genes used in the original training dataset, a small minority are given high importance in identifying invasive strains. Variable importance is shown for the top 1000 genes used in the original training set. Variable importance was measured as average decrease in Gini index in a random forest model trained on all orthologous groups that met the inclusion criteria (N = 6,438). C: Functional categories associated with the top predictive genes. D: Mutations in mrcB (penicillin-binding protein 1b), one of the top three predictors. Mutations in different strains are colour-coded, with bars in red indicating a mutation in an extraintestinal strain and bars in blue indicating a mutation in a gastrointestinal strain. An estimate of the effect of the mutation on protein function (DeltaBS) is shown on the y-axis, with positive values indicating higher chance of a mutation impacting protein function. The x-axis represents the length of the protein.
Fig 3.
Voting of the model on African iNTS and global gastrointestinal isolates.
A: Maximum likelihood phylogeny of all S. Enteritidis isolates included in the study, annotated with invasiveness ranking and clade (note: Outlier refers to the distinct sister clade of the global epidemic strains identified by [48], while Other refers to strains that don’t belong to a named clade). B: Invasiveness indices for African and non-African clades of Salmonella. Lower and upper boundaries of the boxplots correspond to the 25th and 75th quantiles. C: The proportion of isolates from each tested dataset carrying a hypothetically attenuated coding sequence (HAC, defined by a DeltaBS>3 relative to the reference serovar). Genes are ordered by the amount of degradation observed in African clades. African strains are shown in the positive y-axis in darker grey, global strains are shown in the negative y-axis in lighter grey.
Fig 4.
Invasiveness indices and DeltaBS (DBS) values for isolates collected during long term invasive infection of an immunocompromised patient provide evidence for parallel adaptation.
Black points show the increase in the invasiveness index over time. Boxplots show a significant shift in DBS distribution over the duration of carriage for genes selected by our model built from well-characterised invasive serovars as compared to the rest of the proteome. Isolates from [10]. DBS distributions for 2001 have been pooled, but are representative for all three isolates individually. The y-axis for DBS values has been truncated for better visualisation.