Impact of the choice of reference genome on the ability of the core genome SNV methodology to distinguish strains of Salmonella enterica serovar Heidelberg

Salmonella enterica serovar Heidelberg (S. Heidelberg) is one of the top serovars causing human salmonellosis. The core genome single nucleotide variant pipeline (cgSNV) is one of several whole genome based sequence typing methods used for the laboratory investigation of foodborne pathogens. SNV detection using this method requires a reference genome. The purpose of this study was to investigate the impact of the choice of the reference genome on the cgSNV-informed phylogenetic clustering and inferred isolate relationships. We found that using a draft or closed genome of S. Heidelberg as reference did not impact the ability of the cgSNV methodology to differentiate among 145 S. Heidelberg isolates involved in foodborne outbreaks. We also found that using a distantly related genome such as S. Dublin as choice of reference led to a loss in resolution since some sporadic isolates were found to cluster together with outbreak isolates. In addition, the genetic distances between outbreak isolates as well as between outbreak and sporadic isolates were overall reduced when S. Dublin was used as the reference genome as opposed to S. Heidelberg.


Introduction
Nontyphoidal Salmonella (NTS) enterica serovars are the most important causes of bacterial gastroenteritis. Among the NTS serovars, Heidelberg is ranked as the second and third most frequent serovar recovered from clinical cases in Québec and Canada respectively [1]. In Québec between 2004 and 2014, 23% of S. Heidelberg clinical isolates were from blood specimens, compared to 7% for S. enterica serovar Enteritidis and 5% for S. enterica serovar Typhimurium, suggesting an increased capacity of this serovar to cause invasive systemic disease [2]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Pulsefield gel electrophoresis (PFGE) has been the gold standard method used by PulseNet Canada (PNC) since the 1990s for the molecular typing of Salmonella during outbreak investigations. However, a major drawback with the use of PFGE in outbreak investigation is the low resolution power of this technique that is further exacerbated when applied to S. Heidelberg typing owing to the extremely low genetic diversity of this serovar. For example, 70% of S. Heidelberg isolated in Québec belonged to pulsovar 2 [2].
Whole genome sequence (WGS) based methods owing to their growing availability and high genomic resolution are rapidly replacing traditional typing methods such as PFGE within major public health laboratories including PNC [3]. Two popular methods that are increasingly applied in the field of bacterial genomic epidemiology are: the gene-by-gene methods which is basically an extension of the 7 gene MLST typing technique to encompass the entire genome (whole genome MLST, wgMLST) or just the core genome (core genome MLST, cgMLST) [4,5] and the single nucleotide variant (SNV) methods which identifies single nucleotide variants by comparing a population of target genomes against a reference [6,7]. We recently found that the cgSNV method provided superior discriminatory power than traditional methods during outbreak investigations involving Salmonella Heidelberg [2].
The choice of the reference genome has been previously proposed as a potential consideration affecting core genome SNV (cgSNV)-based analysis and outcomes. For example, choosing a distantly related strain as a source of reference may tend to cluster isolates that are otherwise genetically distant. Another concern with the choice of reference genome is the sequencing status of the reference genome. It is generally perceived that high-quality complete genomes are preferred to ensure accurate and epidemiologically concordant phylogenetic analysis and outbreak investigation. These concerns were not addressed in our previous work on the cgSNV method [2]. Here using draft de novo assembled and completely sequenced or closed genomes of S. Heidelberg as well as a distantly related genome such as S. Dublin as references, we assessed the ability of the cgSNV methodology to differentiate amongst 145 S. Heidelberg strains involved in four distinct outbreaks and sporadic cases of salmonellosis in Québec.

Collection and characterization of bacterial isolates
The 145 S. Heidelberg clinical isolates described in this study were collected as part of the Quebéc surveillance program on human salmonellosis established since 2003 to ensure rapid detection of outbreaks. The food isolates were collected by the Ministère de l'Agriculture, des Pêcheries et de l'Alimentation du Québec (MAPAQ) during routine food-poisoning investigations. Isolates were grown on triple sugar iron agar at 37˚C and stored at -80˚C in trypticase soy spiked with 10% glycerol. PFGE and serotyping was performed at the Laboratoire de Santé Publique du Québec (LSPQ) following PNC guidelines.

Whole genome sequencing
Frozen bacterial isolates were cultured overnight at 37˚C in brain heart infusion broth and genomic DNA was extracted using the Metagenomic DNA isolation Kit for Water (Epicentre, Madison, WI). Samples were prepared using Nextera XT chemistry (Illumina, Inc., San Diego, CA) and were sequenced using Illumina Miseq paired-end read technology using 300 base read lengths. Five strains were selected from the outbreak isolates to serve as references and their reads were de novo assembled using SPAdes v. 3.9 [8]. The complete genome counterparts for these strains have been reported in a previous work [9]. Two draft and three complete, unrelated reference genomes were also downloaded from NCBI and included in this analysis making a total of 15 assessed reference genomes (Table 1).

CgSNV typing
CgSNV analysis was performed using the SNVPhyl pipeline [10] v.1.0 integrated within the NML instance of the Galaxy platform [11]. Briefly, paired-end sequence reads from the 145 isolates were aligned against each of the 15 reference genomes using SMALT v.0.7.5 (http:// www.sanger.ac.uk/science/tools/smalt-0). MUMmer v.3.23 [12] and PHAST [13] were used to identify repeat and prophage regions in each reference genome respectively and these regions were excluded from the analysis. Variants were called using two independent variant calling algorithms: FreeBayes v.0.9.20 and SAMtools [14] /BCFtools based on predefined criteria described elsewhere [10]. To infer the relationship between these isolates, minimum spanning trees were constructed from the SNVphyl output data using the geoBURST algorithm built into PHYLOViZ v2.0 [15].

Topological similarity
We assessed the topological similarity of the phylogenetic trees using the Robinson and Foulds (RF) test [16]. This test is a widely used tree metric for tree-to-tree distances and is defined as the minimum number of operations needed to transform one tree into the other. Briefly, newick tree files from the SNVphyl pipeline generated using each of the 14 genomes as references were concatenated and the resulting file was submitted to the online phylogenetic tool T-REX [17] to compute the topological distances between the trees.

Nucleotide sequence accession numbers
The sequence data supporting the results of this article have been deposited in the NCBI Sequence Read Archive under accession number SRP098783.

Heidelberg isolates
Epidemiologic and PFGE fingerprinting results of the 145 S. Heidelberg isolates used in this study are presented in Table 2.
Isolates from four distinct outbreaks that occurred in Quebéc between 2012 and 2016 were also included in the analysis. These outbreaks were designated as follows: outbreak 1, 2012 (n = 10; 8 human and 2 food isolates) outbreak 2, 2013 (n = 8 human isolates) outbreak 3, 2014 (n = 28; 12 human and 16 food isolates and outbreak 4, 2016 (n = 8 human isolates). All human cases and food items linked to these outbreaks were confirmed by epi-data. Outbreak 1 was linked to a wedding party, outbreak 2 and 3 were traced to separate restaurants and outbreak 4 was associated with a daycare catering service. In addition to the outbreak isolates (n = 54) we also added 91 sporadic clinical isolates collected in Quebéc between 2007 and 2016 into the analysis.

Whole genome sequencing results
An average of 983,919 reads was obtained per isolate (range, 339,270-2,974,917) for the set of 145 S. Heidelberg isolates, corresponding to an estimated average genome coverage of 121x (range, 42x -365x). The number of SPAdes-assembled contigs for the five outbreak isolates that were selected to act as the reference ranged from 24-27 with all the isolates assembled into fewer than 27 contigs ( Table 3). The completely sequenced genome equivalent of these isolates have been published in a previous study [9].

Core genome single nucleotide analysis
After removing repeats and prophages as well as SNV-dense regions from all the reference genomes, an average of 4,008,254 genomic positions (range 3,681,444-4,049,343) representing an average of 86% of the reference genomes (range 84.63-86.72%) had sufficient coverage (!15x) across all 145 isolates for reference mapping. For all the S. Heidelberg reference genomes, an average of 769 high-quality consensus SNVs (range 751-819) were identified by both variant callers as common to all isolates and used for subsequent phylogenetic clustering. For the distantly related S. Dublin genome, 18,155 high-quality core genome SNVs were used to construct the phylogeny. In total, 15 minimum spanning trees were generated with 14 of these trees representing the 14 S. Heidelberg reference genomes. All outbreak isolates formed distinct clusters with all the 14 S. Heidelberg reference genomes and the topologies of these trees were highly similar (Fig 1A and 1B).
Using S. Dublin as the reference genome led to a loss in resolving power. In fact, sporadic isolates clustered with outbreak 1 isolates (Fig 2).
The phylogenetic features revealed by the minimum spanning trees were nearly identical across all the 14 S. Heidelberg reference genomes ( Table 4).
The genetic distances observed between outbreak isolates as well as between outbreak and sporadic isolates were nearly identical across the 14 S. Heidelberg reference genomes. Using S. Dublin as the reference genome led to a significant reduction in genetic distances between sporadic and outbreak isolates with some sporadic isolates having 0 and 3 SNV difference with outbreak 1 and 3 respectively whereas the genetic distances between isolates in these outbreaks ranged from 0-3 SNVs (Table 5).

Topological similarity of the phylogenetic trees
To confirm the similarities in tree topologies, we performed the RF test. The computed RF topological distances between the 14 trees ranged from 0 to 24 (Table 6).

Discussion
In this study we assessed the impact of the choice of the reference genome on the resolution clustering of S. Heidelberg outbreak and sporadic isolates using the cgSNV methodology. Our results revealed that using a draft or completely sequenced S. Heidelberg genomes as references did not affect the ability of the cgSNV method to distinguish between four epidemiologically well characterized S. Heidelberg outbreak isolates and to separate these isolates from sporadic or background strains. In fact, all outbreak and sporadic isolates were clustered on distinct branches (Fig 1A and 1B). On the contrary, using S. Dublin as reference choice resulted in a tree with less resolution (Fig 2). Sporadic isolates clustered together with outbreak 1 isolates and in addition, the genetic distances observed within outbreak as well as between outbreak and sporadic isolates was overall reduced when S. Dublin was used as reference choice as opposed to S. Heidelberg reference genomes (Figs 1 and 2, Table 5). This finding is in agreement with a recent study on Salmonella which found that using a distantly related genome as a choice of reference failed to cluster S. Enteritidis outbreak strains concordantly [18]. These observations emphasize the need to choose an appropriate reference genome during laboratory investigations of foodborne outbreaks involving reference based methods. The loss in resolution observed in this study can be due to the following reasons: Firstly, the SNVPhyl pipeline uses only core genome SNVs to build the phylogeny implying that a reference genome with high similarity with the isolates under investigation would have a larger core genome from which more SNVs can be produced to build a phylogeny as opposed to a dissimilar reference such as S. Dublin. However, the smallest core genome among all the reference genomes used in this study was equivalent to 84.76% of the total genome. Secondly, another issue to consider is that as the reference genome grows more distant from the sequences under analysis, more variation would be observed between the core regions of the reference genome and all other sequences leading to a long branch separating the reference Impact of reference genome choice on the core genome SNV methodology genome from the rest of the other isolates. This was indeed what we observed using S. Dublin as reference (Fig 2) since the majority of the 18155 positions used to construct the phylogeny were indeed variations between the reference (17666 positions) and the other isolates.
The RF values for the trees constructed using S. Heidelberg draft genome and its corresponding closed genome equivalent was zero with the exception of the reference genome pair ID134609 (tree 7 vs. tree 8) whereby the RF topological distance was 24. This difference could be attributed to misidentification of SNVs linked to the presence of repetitive regions in the draft genome that were not properly detected. By nature, draft genomes are not entirely genomically accurate and for this reason, it is possible that the SPAdes assembly for this isolate may have collapsed repetitive regions larger than the read/read pair size into a single contig. Aligning reads to repetitive regions is problematic and has been reported to lead to potential misidentification of SNVs [19]. Although we used the software MUMmer to identify and filter the repetitive regions of the reference genome, for a closed genome, this method will be much more successful due to its intact, completely mapped sequence than for a draft genome, which may erroneously possess several copies of unmapped repeat regions. These unmapped repeats can lead to repeat reads mapping to a single contig resulting in the inclusion of additional SNVs and subsequent alignment issues and false SNV calls. Interestingly in this reference pair, 766 sites were used to generate the phylogeny using the draft genome as reference as opposed to 761 sites for the closed genome counterpart. Despite the slight differences in topology between this reference pair as well as in other tree pairs, both the draft and closed genomes Impact of reference genome choice on the core genome SNV methodology  Impact of reference genome choice on the core genome SNV methodology   Impact of reference genome choice on the core genome SNV methodology were still able to differentiate the outbreak from sporadic isolates in concordance with epidemiological data. In agreement with our observations, a recent study on Listeria monocytogenes also demonstrated that phylogenetic clustering based on SNV analysis using a de novo assembled draft genome selected from within the group was similar to the phylogeny using a closely related closed genome [6]. Despite the increasing accessibility of WGS-based technologies and decreasing costs, the implementation of WGS for routine surveillance and outbreak investigation may still present many challenges as sequencing a genome to completion remains a costly and time-consuming endeavour and for many public-health laboratories, this is not a viable option. [20]. The results of our study indicates that draft genomes can be relied upon as a suitable reference choice during laboratory investigations of foodborne outbreaks using the cgSNVphyl pipeline. In fact, a recent study evaluated the quality score of 32000 genomes located in public repositories and concluded that most of these genomes were of sufficient quality to perform analysis on and only 10% of draft genomes were of poor quality and unsuitable for downstream analysis [21]. In conclusion, our results provide strong evidence that the choice of the reference genome does not impact the ability of the cgSNV methodology to distinguish between S. Heidelberg isolates involved in foodborne outbreaks. Our results also demonstrate that using a distantly related genome as reference could lead to a loss in resolution during cgSNV analysis. Although wgMLST was recently recommended as the primary subtyping tool moving forward by PNC and other foodborne surveillance networks [22], it is important to note that the cgSNV approach still remains an important method in the PNC molecular tool box. In fact, this method was recently used by PNC in collaboration with provincial public health laboratories to identify the source of a multi-provincial outbreak of S. Enteritidis [23]. Despite the advantages provided by cg/wgMLST approaches such as curability and standardization, the development and validation of schemas for each organism still remains a daunting task both financially and technically for public health laboratories operating with limited resources. The implementation of the cgSNV methodology described here could be a Table 6. Robinson-Foulds topological distances between trees generated with 145 S. Heidelberg sequenced isolates using draft and closed genomes as references during cgSNV analysis. Tree  3   Tree  4   Tree  5   Tree  6   Tree  7   Tree  8   Tree  9   Tree  10   Tree  11   Tree  12   Tree  13   Tree  14   Tree 1 0  20  20  20  20  20  20  20  20  7  11  15  11   Tree 2  20  20  20  20  20  20  20  10  7  11  15  11   Tree 3  0  0  0  24  0  24  24  13  11  15  11   Tree 4  0  0  24  0  24  24  13  11  15  11   Tree 5  0  24  0  24  24  13  11  15  11   Tree 6  24  0  24  24  13  11  15  Impact of reference genome choice on the core genome SNV methodology viable alternative for monitoring S. Heidelberg. Whether this method applies to other Salmonella serovars remains to be determined.