Fig 1.
Analysis of sources of error in serovar predictions on a set of 4,291 Salmonella enterica draft genome sequences analyzed using the SISTR platform.
The contribution of various types of errors contributing to observed differences between “reported” and “predicted” serovars was tabulated. From a total of 4,192 genomes retained for the analysis, 3,982 genomes had correct serovar predictions (94.99%).
Fig 2.
In silico serovar prediction identifies instances of Salmonella genomes with incorrect reported serovar information.
Among a large cgMLST330 cluster of genomes of reported serovar Agona were found four genomes of different reported serovar (Paratyphi B, n = 2; Anatum, n = 1; Derby, n = 1). Upon closer inspection, the serovar prediction for these genomes was found to be Agona, consistent with the underlying cgMLST330 data.
Fig 3.
The SISTR serovar prediction logic can robustly yield accurate predictions for a range of genome qualities.
(A) The relative proportion of various error types does not change appreciably as a function of the N50 assembly quality parameter. Type 4 errors, which are related to poor cgMLST330 metrics, are only observed among genomes with lowest N50 values. (B) Although large numbers of missing cgMLST330 loci affect serovar prediction, as observed with errors of Type 4, accurate predictions were also made for genomes with as few as 296 complete cgMLST330 loci.
Fig 4.
A high accuracy was observed for the SISTR serovar prediction pipeline.
The prediction accuracy was assessed for serovars with 10 or more genome representatives and based only on genomes with metadata of sufficient quality to enable extraction of serovar information and those with high cgMLST data quality (n = 4,188). Accuracy was computed based on concordance between reported and predicted serovars. The “uncorrected” prediction accuracy, which is based on the original set of input genomes (n = 4,291) is shown in red. A “corrected” prediction accuracy, which is based on reclassification of genomes with Type 1 and 2 errors, and removal of genomes with Type 3 and 4 errors, is shown in blue. (note: where distinct corrected and uncorrected concordance values are not observable, both values are identical).
Fig 5.
Differences between reported and predicted serovars for Weltevreden genomes are due to an incorrect reported serovar.
While a large majority of genomes analyzed in the SISTR server were predicted as Weltevreden (n = 64), the remaining four genomes were predicted to have different serovars; these predictions matched the predominant serovar of their corresponding cgMLST cluster. The percent concordance between cgMLST and serovar and cgMLST cluster size are shown in parentheses.
Fig 6.
A high level of concordance observed between cgMLST cluster and serovar.
The concordance was based on the proportion of genomes in a cgMLST cluster that belonged to the predominant predicted serovar in the group. The cgMLST clusters were defined at a similarity threshold of 85%; only clusters with four or more members are shown.
Fig 7.
Evidence from cgMLST supports the polyphyletic origin of Salmonella Newport.
Minimum Spanning Tree visualization of cgMLST phylogeny for a set of Salmonella enterica genomes (n = 2,002) created in the SISTR server. The predicted serovar for Newport and Agona genomes has been projected onto the tree to highlight the contrast between a polyphyletic serovar (Newport) and a monophyletic serovar (Agona).