Fig 1.
Overview of the workflow used.
For each species, we selected different (3–8) publicly available closed whole-genome sequences as references and 20 sets of short-reads from whole-genome sequencing projects. Reads were mapped to each selected reference genome per species and consensus sequences were obtained from quality SNPs of each mapping. Consensus sequences from the mappings to the same reference genome were added to the MSA of all references of each species. For the analysis of each MSA, (a) we considered only those genome regions present in the reference used for mapping and (b) we obtained a ‘core’ MSA by removing all the regions absent from any of the reference sequences. Finally, we studied the impact of reference choice on the ML trees inferred from each MSA, recombination rates calculated on ‘core’ MSAs and dN/dS ratios calculated considering only coding sequences.
Fig 2.
Distribution of proportion of mapped reads depending on reference choice.
Fig 3.
Distribution of coverage of the reference genome depending on reference choice.
Table 1.
Proportion of significant (P<0.05) comparisons depending on reference choice.
Fig 4.
Distribution of the average depth depending on reference choice.
Fig 5.
Distribution of the number of SNPs depending on reference choice.
Fig 6.
Comparison of Robinson-Foulds (RF) and matching clusters (MC) normalized distances calculated between trees from the same species.
Table 2.
Descriptive statistics of topological distances per species.
Fig 7.
Comparison of RF distances against ANI calculated between the reference genomes selected for each species.
Table 3.
Congruent comparisons according to ELW test.
All the other pairwise comparisons were not congruent (P<0.05).
Fig 8.
Impact of reference choice on phylogenetic trees of L. pneumophila.
ML trees included the selected reference sequences of L. pneumophila and the consensus sequences obtained from mappings against strains (A) Philadelphia 1, (B) Paris, (C) Alcoy and (D) Lansing 3. Clusters of isolates related with references Paris (red) and Alcoy (blue) are coloured in the first three phylogenies. Isolates 28HGV and 91HGV (highlighted in yellow) were placed in different clades in the trees when using references Paris and Alcoy. Clade of references resulting from using Lansing 3 as reference genome is coloured in red.
Fig 9.
Impact of reference choice on phylogenetic trees of K. pneumoniae.
ML trees included the selected reference sequences from K. pneumoniae and the consensus sequences obtained from mappings against strains (A) HS11286 and (B) NTUH-K2044. Isolates HGV2C-06 and HCV1-10 (yellow) changed their placement depending on reference choice.
Fig 10.
Impact of reference choice on phylogenetic trees of P. aeruginosa.
ML trees included the selected reference sequences of P. aeruginosa and the consensus sequences obtained from mappings against strains (A) M18 and (B) 12939. Reference M18 and isolate P5M1 (yellow) alter their phylogenetic relationships depending on reference choice.
Fig 11.
Impact of reference choice on phylogenetic trees of S.marcescens.
ML trees included the selected reference sequences from S. marcescens and the consensus sequences calculated from alignments against strains (A) UMH9 and (B) WW4. Outbreak clade is shown in red.
Fig 12.
Recombination rate distribution depending on reference choice between ‘core’ MSAs including sequences from N. gonorrhoeae.
Fig 13.
Distribution of dN/dS depending on reference choice.