Recombination and Population Structure in Salmonella enterica

Salmonella enterica is a bacterial pathogen that causes enteric fever and gastroenteritis in humans and animals. Although its population structure was long described as clonal, based on high linkage disequilibrium between loci typed by enzyme electrophoresis, recent examination of gene sequences has revealed that recombination plays an important evolutionary role. We sequenced around 10% of the core genome of 114 isolates of enterica using a resequencing microarray. Application of two different analysis methods (Structure and ClonalFrame) to our genomic data allowed us to define five clear lineages within S. enterica subspecies enterica, one of which is five times older than the other four and two thirds of the age of the whole subspecies. We show that some of these lineages display more evidence of recombination than others. We also demonstrate that some level of sexual isolation exists between the lineages, so that recombination has occurred predominantly between members of the same lineage. This pattern of recombination is compatible with expectations from the previously described ecological structuring of the enterica population as well as mechanistic barriers to recombination observed in laboratory experiments. In spite of their relatively low level of genetic differentiation, these lineages might therefore represent incipient species.


Introduction
Salmonella enterica subspecies enterica (subsequently referred to simply as enterica) is a major cause of enteric fever in humans and gastroenteritis in humans and animals.Its diversity has traditionally been described on the basis of serological differences following the Kauffmann-White classification [1,2].Certain serovars are linked to particular diseases and hosts.For example, enteric fever is mostly caused by members of serovar Typhi and Paratyphi A, both of which only infect humans [3].Gastroenteritis on the other hand is most often caused by Enteritidis in humans and Typhimurium in animals [4], although both serovars can infect a wide range of hosts [3].However, the usefulness of the serological classification of S. enterica is undermined by the fact that unrelated strains sometimes belong to the same serovar [5,6].
In an attempt to shed some new light on the population structure of enterica, a multi-locus sequence typing scheme (MLST; [7,8]) was developed which relies on the sequencing of 400-500 bp fragments from seven housekeeping genes.This typing technique was originally applied to strains from serovar Typhi [9], and later to the whole of enterica [10,11].Phylogenies reconstructed from MLST data are highly star-shaped [12] and therefore carry little information about relationships between isolates.This can be traced back to substantial incongruencies between gene trees [13,12,14], which are often caused by high levels of homologous recombination [15].This is in contrast for example with the closely related species Escherichia coli which has a well defined population structure made of several clearly defined clades [16].
The first genomes of enterica to be fully sequenced were those of Typhimurium LT2 [17] and Typhi CT18 [18], followed by those of Typhi Ty2 [19], Paratyphi A [20] and Choleraesuis [21].A comparison of the genomes of Typhi and Paratyphi A revealed that they had exchanged about a quarter of their genes during the course of their adaptation to a human-specific and highly virulent lifestyle [22].This high level of recombination is, however, exceptional between two distantly related lineages of enterica [22], and selection is likely to have favoured recombinants between these two types which combined adaptations to their new host [22].The pattern of recombination of these strains, with a burst of recombination being followed by completely clonal evolution [23,24], appeared to be atypical of gene flow in the species as a whole, but only limited data from a small number of lineages has been analyzed [22].The number of enterica genomes currently available is insufficient (only eleven whole published genomes available at the time of writing in the Genomes OnLine Database; [25]), and their distribution is too focused on highly virulent types to allow an exploration of the population genetics of enterica.Furthermore statistical methodology to analyze such wholegenome data efficiently is currently lacking [26,15].
Reconstructing the clonal relationships between lineages that have evolved under the influence of recombination requires data from a large number of loci [27].We therefore designed an Affymetrix CustomSeq Resequencing Array to sequence approximately 300Kbp from the core genome of enterica isolates, which represents two orders of magnitude more data per isolate than is provided by MLST.Resequencing arrays are a highly parallel DNA sequencing technology with quick application and low cost, and are based on the principle of sequencing by hybridization [28].They have been previously applied to a wide diversity of bacterial samples, including monomorphic clones such as Bacillus anthracis [29] or Mycobacterium tuberculosis [30], relatively clonal species such as Bacillus cereus [31] or Staphylococcus aureus [32], and species with high rates of recombination such as Neisseria meningitidis [33] or Francisella tularensis [34].
We applied our resequencing array to a global collection of 114 isolates from multiple major lineages of enterica, with the exception of Typhi.Typhi was excluded because extensive studies using a wide range of molecular techniques [23,35,24,36,37] have revealed that its population biology differs from that of other lineages of enterica.We therefore excluded Typhi from the present study in order to focus on the remainder of enterica, which has been studied much less thoroughly.The main aims of this study were to provide an improved description of the population structure of enterica and to clarify the role played by recombination during its evolution.To this end, we analyzed our genetic data using the linkage model of Structure [38,39] and ClonalFrame [40] with a posteriori attribution of the origin of recombination events [41].

Novel nucleotide sequences
For each of the 114 isolates under study (Table S1) we resequenced 146 regions of length 2000-2500bp each from the core-genome of enterica (Table S2).These 295,137 bp per isolate represent approximately 10% of the core genome of enterica [42].Figure 1 illustrates the extent of our resequencing scheme on the genome of Typhimurium LT2 [17].On average, 85% of nucleotides were called, with variation across isolates ranging from 75% to 95%.A total of 18,068 of the resequenced sites (6%) were found to be polymorphic in this sample.Regions overlapping the seven MLST loci were included in our resequencing scheme, and by comparing our results with preexisting MLST sequences we estimated the error rate of our method to be lower than one error per 10,000 calls.Only one isolate had more than one error in its MLST gene fragments: isolate 54 (SARB32; ST82) had two errors, one in gene hisD and the other in gene purE.An equivalent error rate was found when comparing the sequence of LT2 reported in [17] with our resequenced sequence of LT2.The density of errors was therefore sufficiently low enough that errors would be misinterpreted as mutations, and would not affect our results below which are essentially focused on the recombination process.

Population structure of Salmonella enterica
We applied the linkage model of Structure [38,39] to our data and identified K~6 ancestral populations in our sample (Figure S1).The proportion of ancestry from each of these sources is shown for each isolate in Figure 2. The 114 isolates fell into six distinct groups based on the major ancestral source of genetic diversity of each isolate.(Figure 2).Group 1 (light blue) consisted of 14 strains of Choleraesuis, Paratyphi C and Typhisuis, Group 2 (dark blue) comprised 12 strains of Typhimurium and Saint-Paul, Group 3 (orange) contained 17 strains of Montevideo, Javiana, Decatur and others, Group 4 (yellow) consisted of 19 strains of Enteritidis, Gallinarum and Dublin and Group 5 (red) comprised 5 strains of Paratyphi A and Sendai.Finally, Group 6 (cyan) contained the remaining 47 strains from diverse serovars.These groups showed relatively little admixture between ancestral sources (Figure 2), with the exception of Group 6, which seemed to have acted frequently both as a donor and as a recipient of recombinational exchanges (Figure 2).
CLONALFRAME is a method designed to reconstruct the clonal relationships between isolates in a sample, while accounting for the effect of non-vertical genetic transfer which would otherwise confuse such a reconstruction [40].Figure 3 shows the clonal genealogy inferred from our data by ClonalFrame.The first five groups identified by Structure (Figure 2) corresponded to clades on Figure 3 and are represented with corresponding colors.Based on the combined evidence from the Structure and ClonalFrame analyses, these five groups can confidently be called lineages of enterica.On the other hand, the sixth group found by Structure encompassed the remaining isolates in Figure 3, which did not constitute a clade in Figure 3 and therefore did not represent a true lineage.Instead, seven small groups of two to four isolates formed small clades at this level of analysis according to ClonalFrame, but these were not detected by Structure.The content of the five identified lineages of enterica is summarized in Table 1.
Using Structure and ClonalFrame on MLST data only revealed parts of this population structure, and hardly revealed any relationships within lineages in comparison with the resequencing array data (Figures S3 and S4).Yet the deep phylogeny of enterica remained largely unresolved when using our resequencing data, and in particular the relationships of the five lineages above with one another and with the rest of the isolates remained unclear (Figure 3).We estimated the age of the five lineages relative to the time of the most common ancestor of the whole of enterica (Table 1).The common ancestor of lineage 5 was the most recent, followed by that of lineage 1. Lineage 3 was found to be particularly ancient, with an estimated age of two thirds of the age of enterica.

Author Summary
Salmonella enterica is a species of bacteria that causes severe diseases in humans and animals.We sequenced about a tenth of the genome from a broadly sampled collection of S. enterica.By comparing these genetic sequences, we were able to partially reconstruct the ancestry of this sample.We identified five lineages within S. enterica, one of which is almost as old as the common ancestor of our sample.We also found evidence for frequent homologous recombination in the ancestry of S. enterica, where fragments of genes from one individual bacterium are acquired by a distinct individual.These recombination events make the ancestry harder to reconstruct in its entirety, but also contain interesting information.We found in particular that recombination had happened more often between strains belonging to the same lineage than across lineage boundaries.This observation is compatible with the lineages of S. enterica becoming progressively isolated from each other, which could lead to their gradual splintering into new species.

Uneven role of recombination in enterica
Widespread recombination has previously been suggested to explain the lack of deep structure in enterica [12,14] and we wanted to assess the role played by recombination in the evolution of enterica.Measuring the frequency of recombination is often done relative to that of mutation [43] by forming the ratio r=h of rates at which recombination and mutation occurred in the ancestry of a sample.ClonalFrame estimated that recombination happened less frequently than mutation with r=h~0:37 (95% credibility interval ½0:33,0:41).Recombination can however change several nucleotides in a single event.Another measure of recombination is therefore the ratio r=m of rates at which substitutions are introduced by recombination and mutation [44].ClonalFrame estimated that recombination and mutation had approximately the same effect in introducing polymorphism with r=m~1:14 (95%CI [1.06, 1.23]).Recombination was found to affect segments of length 1826 bp on average (95%CI [1670,1980]) which is comparable to the lengths of recombination tracts estimated when comparing four genomes of Typhimurium [40] as well as the lengths of the regions that were exchanged by Typhi and Paratyphi A [22].
We further studied recombination by looking at its specific role and patterns within each of the five lineages of enterica.The role played by recombination seems to be uneven across these five lineages according to the Structure results in Figure 2. The isolates in recently diversified populations 1 and 5 showed no admixture (v1% of material from other populations) whereas the isolates in population 4, 3 and 2 had acquired 4%, 11% and 12% respectively of their genetic material from a different population (Figure 2).To confirm this observation, we extracted from ClonalFrame output the numbers of mutation events, recombination events, and substitutions introduced by recombination for each of the five lineages (Table 1).Recombination was found to have played a much more important role relative to mutation in lineages 2 and 3 (r=m = 2.17 and 2.95 respectively) than in lineages 1 and 5 (r=m = 0.20 and 0.15 respectively), and a somewhat intermediate role in lineage 4 (r=m = 0.82).These results are in good qualitative agreement with those of Structure (Figure 2).Since lineages 1 and 5 are the most recently evolved from a common ancestor, these results point to a possible reduction in the role played by recombination in these two lineages, and maybe even throughout enterica.

Patterns of genetic flux in enterica
ClonalFrame estimated that within the regions imported by recombination, an average of n~0:32% of the nucleotides were substituted (95%CI [0.31%, 0.33%]).This value of n is significantly lower than the average pairwise distance between two members of enterica which is around 1% [12].The same applies to the distribution of genetic diversity introduced by recombination events (Figure S5).This observation goes against the natural tendency of ClonalFrame which is to identify more readily events between distantly related types [40,41], and therefore indicates that recombination happened predominantly between related strains during the evolution of enterica, with recombination between distinct lineages being rarer.
We attempted to attribute an origin to each recombination event found by ClonalFrame in the five lineages following the method of [41].Table S3 shows the events for which an origin could be unambiguously attributed, and Figure 4 illustrates the flux of recombination between the five lineages as well as the events coming from other origins within enterica.In lineages 1, 3 and 5, the majority of events was found to come from within these lineages even if ClonalFrame is predisposed to underestimate the propensity of such events [40].In lineages 2 and 4 however, the primary source of recombination events was ''External'', i.e. not contained within one of the five lineages (Figure 4).The origin of these events was not attributed to any isolate or group of isolates in particular, but seemed to come fairly uniformly from all parts of enterica minus the five lineages.

Delineation of enterica
We have sequenced approximately one tenth of the core genome from 114 isolates of enterica from global sources in order to study its population structure.We identified five clear lineages, defined as groups of isolates having the same majority of ancestry in the Structure analysis and representing a clade in the ClonalFrame analysis.It is likely that other similar lineages exist and would be identified using a larger sample of strains.For example, the four strains of serovar Heidelberg (labelled 44, 45, 70 and 81) were closely related to each other (Figure 3) and would probably have been called a lineage in our analysis if our sample had contained one or two more similar isolates, since lineage 5 was reconstructed based on only 5 isolates (Table 1).Our analysis did not include any isolate of serovar Typhi, which has previously been shown based on whole-genome comparisons to be highly monomorphic [19,24,36] and unrelated to other serovars [22,45].In the context of the enterica data reported here, Typhi would thus constitute a separate and independent lineage, with all current Typhi samples descended from a recent common ancestor on this lineage.
One of the five lineages we identified is particularly ancient, estimated to be two thirds of the age of enterica.In the absence of an internal mutation rate for enterica [46], it is currently not possible to date this age in terms of years.This ancient lineage was designated as ''clade B'' in a previous study based on MLST [12], which also noted that it might represent the deepest lineage within enterica but that MLST data was insufficient to confirm this hypothesis.Here we provide such data and confirm the existence of this lineage.The identification of this deep lineage is in sharp contrast with a lack of resolution in the deep ancestry of enterica in general (Figure 3).A star-shaped phylogeny had also been reconstructed before based on MLST data [12].Two non-mutually exclusive hypotheses can be proposed to explain this observation: a loss of information about clonal relationships due to extensive recombination [47], and the fast growth of the effective population size shortly following the birth of the population [48].

Patterns of recombination in enterica
It is now clear that recombination plays a driving role in the evolution of many bacteria [15], including Salmonella [14].It has been noted that recombination happens more often within the subspecies of Salmonella enterica than between members of separate subspecies [13], but little is known about the details of the recombination process within subspecies enterica.A recent study based on MLST data hinted at an unusually high rate of recombination between the Newport-II and Newport-III groups [11].However, the number of recombination events detectable with MLST is generally too small to draw hard conclusions about rates of recombination.Here we sequenced a hundred times more data per isolate than MLST, which allowed us to reconstruct many recombination events, thus revealing clear patterns.We found evidence for recombination that varied over at least an order of magnitude across lineages of enterica (Table 1).Different recombination rates for individual lineages of a same species have been found previously between the seroresistant and serosensitive clades of Moraxella catarrhalis [49], between lineages I and II of Listeria monocytogenes [50,51], and between the six hypervirulent lineages of Neisseria meningitidis [27].It is likely that more examples will be found in future studies as improved methods for detecting recombination are applied to large datasets of whole genomes [52].
Recombination events that occurred between distantly related bacteria are easier to detect than events involving close relatives, because they introduce more polymorphism.ClonalFrame is especially biased against the detection of intra-lineage recombination, because it is based on a model of extra-population recombination [40].In spite of this, we found that recombination was predominantly between members of a lineage in at least three of the five lineages (Figure 4).At least three hypotheses can be formulated to explain this general pattern.Firstly, certain serovars of enterica are restricted or associated with specific host species [3] which may result in greater opportunities for recombination between related strains, as previously described in Campylobacter jejuni [53].For instance, lineage 5 consists of isolates of Paratyphi A and Sendai which are restricted to infecting humans [20,22].However, lineage 1 contains serovars Choleraesuis, Paratyphi C and Typhisuis which share the same antigenic formula but are differentially adapted to infecting swine, humans and swine, respectively [54].The other three lineages contain isolates from serovars that are usually described as ubiquitous [3].Secondly, imports from a distant source might reduce the fitness of the recipients and therefore be removed by selection.Thirdly, laboratory experiments have shown that in many bacteria the chances of success of an import decrease exponentially with the genetic distance between donor and recipient due to the DNA mismatch repair system [55,56].This decrease is particularly strong in enterica, with recombination between Typhi and Typhimurium reported to be 10 6 times less likely than within Typhimurium [57,56].The predominance of recombination events within lineages could thus reflect a fundamental property of recombination rather than ecological structuring or selection.

Speciation in enterica
The genus Salmonella is now generally accepted to contain two species, S. bongori and S. enterica, the latter of which consists of six subspecies including subspecies enterica which is the subject of the present study [58,59].Many previously named species that had been defined on the basis of phenotypic differences were regrouped into the single species S. enterica on the basis of DNA hybridization results [60].
The difficulty in defining bacterial species stems from our lack of understanding of the processes involved in their formation [61].Recombination plays a cohesive role in bacteria, so that lineages can evolve into separate species only if recombination is rare between members of distinct lineages [56,62].Computer simulations have shown that reduced recombination between lineages can lead to patterns of genetic diversity that are similar to those observed in nature [12,63].Our reconstruction of recombination flux within and between the five lineages of enterica (Figure 4) strongly supports the existence of barriers to recombination between members of separate lineages.It is therefore possible that the five lineages we identified in enterica represent incipient species which have already diverged too far from each other for recombination to regroup them.Such incipient species have the potential to eventually become separate species unless an important shift in genetic flow occurred like the one that was recently reported between Campylobacter jejuni and coli [64].
Many biological models of bacterial speciation have been proposed in the literature, and it is interesting although speculative to ask ourselves which ones apply to the diversification pattern we described in enterica.Under a strict host-association, speciation would be expected to happen through the periodic selection model where adaptation to a host progressively drives between-lineages divergence whilst constraining the genetic diversity of each lineage [65,66].This model might apply to lineage 5 which contains serovars restricted to humans, but is unlikely to apply to the other four lineages which can be found in a range of hosts.Alternatively, speciation in enterica could be driven by co-evolution with certain bacteriophages which have been shown to infect some serovars more readily than others [67].Under the geographic mosaic model [68,69], such uneven adaptive pressures can increase the rate of divergence between populations, and this effect was demonstrated in laboratory experiments on Pseudomonas fluorescens [70].Future research aimed at testing the geographic mosaic theory will need to investigate whether the underlying process is relevant to the evolution of enterica [71].

Comparing Structure and ClonalFrame
The results we have described were obtained using two popular analytical tools: Structure [38] and ClonalFrame [40], which are based on very different evolutionary models.Structure assumes that each individual in the sample is a mixture from a number of unrelated ancestral populations.ClonalFrame assumes that the individuals are related via a phylogenetic framework, but that clonal relationships are occasionally obscured by recombination events.Clearly the Structure model makes more sense for highly recombinogenic species (for example H. pylori; [72]) and the ClonalFrame model for mostly clonal bacteria (for example Yersinia pestis; [73]).However, for many species including Salmonella enterica, recombination occurs but is not sufficiently frequent to completely erase all clonal relationships.Species with such intermediate population structure are eminently suitable for analysis by both models.
We have demonstrated that a combined approach using both methods can aid interpretations of population structure and ancestry.In order to study genetic flux, we needed to first define lineages on the ClonalFrame phylogeny (Figure 3), and Structure allowed us to determine which clades represent meaningful populations.Conversely, the clustering by Structure (Figure 2) could easily have been misinterpreted in the absence of the phylogenetic information provided by ClonalFrame.Structure suggested the existence of a sixth population which seemed to be both a frequent donor and recipient of recombination events (Figure 2).This sixth population is in fact a random mixture of all ''other'' strains that did not fall into one of the five true lineages (Figure 3) and therefore does not represent a real evolutionary lineage.We therefore interpret this sixth population as an artifact and do not believe that it represents a true evolutionary lineage.In interpreting the levels of mixed ancestry of these five lineages it is also important to note their different relative ages (Figure 3; Table 1).Older lineages will have had more opportunities for recombination than recent ones, resulting in greater admixture in some lineages than in others.Once the outputs of the two methods were interpreted correctly in the light of each other, it became clear that they were in good agreement and allowed a more detailed and trustworthy analysis than each approach would have allowed on its own.

Bacterial isolates
We analysed a total of 114 previously described isolates of enterica including nine from the Salmonella reference collection A (SARA; [74]), and 63 of the 72 strains in the Salmonella reference collection B (SARB; [75]).The isolates were chosen to span the global diversity of enterica as measured by serotyping and MLST.Table S1 contains the full list of the 114 isolates, including their serotype and Sequence Type (ST) in the MLST scheme of [9].A database of isolates that have been typed using this MLST scheme is accessible at http://mlst.ucc.ie/mlst/dbs/Senterica.

Choice of genomic regions to sequence
The genome of Typhimurium LT2 [17] was aligned using Mauve [76,77] against the following ten publicly available genomes from the Genomes OnLine Database (accessible at http://www.genomesonline.org;[25]): Choleraesuis [21], Dublin (University of Illinois, unpublished), Pullorum (University of Illinois, unpublished), Paratyphi A [20], Paratyphi B (University of Washington, unpublished), Typhi CT18 [18], Enteritidis PT4 [78], Gallinarum [78], Hadar (Sanger Institute, unpublished) and Infantis (Sanger Institute, unpublished).The black circle on Figure 1 shows the proportion of these ten genomes that aligned to various parts of the LT2 genome.We selected 146 regions of length 2000-2500bp each from the core genome of enterica where at least nine of the ten genomes aligned with LT2.The regions were selected to be distributed evenly around the genome of LT2 (Figure 1), and to include the location of the MLST fragments of the scheme of [9].This allowed an assessment of the accuracy of the sequencing and direct assessment of analysis based on MLST data.Table S2 contains the location and gene content of each region.

Resequencing scheme
We designed an Affymetrix CustomSeq Resequencing Array to sequence each of the 114 isolates in Table S1 across the 146 genomic regions listed in Table S2.The reference genome on the microarray was generated by in silico optimisation of the probability of accurately resequencing the 11 genomes above.Briefly, we started with the genome of LT2 as reference, proposed iterative changes accepted only when they decreased the chance of having two differences within 25 bp between the reference and one of the 11 genomes (which might make them more difficult to call), and repeated the process until convergence.Tests performed on an earlier version of our resequencing array showed that such an optimised reference performed better than using the genome of LT2 as reference in terms of both calling and error rates (data not shown).Base calling was performed using the Affymetrix GeneChip Sequence Analysis Software (GSEQ).We excluded the GSEQ calls of differences from the reference sequence which were within 13 bp of each other.Such calls are unreliable because hybridization at the central position of a probe can be affected by additional differences in the flanking 12 bp.Our resequenced data is available from http://www.stats.ox.ac.uk/lab/salmonella.zip.

Structure analysis
We used the Bayesian analysis tool Structure version 2.3 [38] to identify the populations present in our data.The linkage model of Structure was used; this explicitly accounts for the correlation between nearby sites that arise in admixed populations [39].Four independent runs were performed for each value of the number of populations K ranging from 2 to 10.Each run consisted of 100,000 MCMC iterations, of which the first half was discarded as burn-in.Convergence and mixing of the program were found to be acceptable by manual comparison of independent runs with the same value of K.The optimal value was found to be K~6 by comparing the posterior probabilities of the data given each value of K from 2 to 10 (Figure S1), and identifying the value of K where the posterior probabilities plateau as described in [79].Applying the method of [80] also resulted in the estimate K~6 (Figure S2).

ClonalFrame analysis
We applied the analysis tool ClonalFrame version 1.2 [40] to our data.ClonalFrame is a Bayesian inference method which jointly reconstructs the clonal relationships between the isolates in a sample, as well as the location of recombination events that have disrupted the clonal signal.Four independent runs of ClonalFrame were performed each consisting of 200,000 MCMC iterations, and the first half was discarded as burn-in.Convergence and mixing of the MCMC were found to be satisfactory by manual comparison of the runs and using the method in [81].The genealogies estimated by ClonalFrame have branch lengths measured in coalescent units of time, which are equal to the effective population size N e times the duration of a generation.We multiplied this by the posterior means of the scaled mutation rate h=2~N e m and the scaled recombination rate r=2~N e r in order to have branch lengths measured in terms of the expected number of mutation and recombination events (where m and r are the pergeneration rates of mutation and recombination).

Attribution of origins to the ClonalFrame recombination events
For each branch of the tree reconstructed by ClonalFrame, we extracted the fragments that had a posterior probability of recombination above 0.5 throughout and which reached 0.95 in at least one position.Each such recombined fragment was then compared with the homologous sequence of all isolates other than those below the affected branch as described [41].If a match was found with 0 or 1 difference, the origin of the recombination was attributed to the lineage to which the matching isolate belongs.If no match was found, or if several isolates from different lineages matched, the origin of the recombined fragment was considered unresolved.Table S1 List of isolates.(PDF)

Supporting Information
Table S2 List of sequenced regions.(PDF) Table S3 Recombination flux between and within lineages.(PDF)

Figure 1 .
Figure 1.The circle represents the Typhimurium LT2 genome [17].The two circles in red represent the coding regions, with the forward strand on the outside and the reverse strand on the inside.The black circle indicates the proportion of 10 other genomes that aligned to each specific region of LT2, with proximity to the center indicating less genomes aligning.The yellow bars represent coverage of our sequencing scheme, and the blue bars coverage of the MLST scheme.This Figure was drawn using DNAPlotter [82].doi:10.1371/journal.pgen.1002191.g001

Figure 2 .Figure 3 .
Figure 2. Result of applying the linkage model of Structure to our data assuming K = 6 populations.Each vertical line represents one of the 114 isolates, ordered on the X axis by the proportion of ancestry from the major ancestral source.The colouring of each vertical line is proportional to the ancestry of each isolate from each of the 6 populations using the following colours: light blue, dark blue, orange, yellow, dark red and cyan representing ancestral populations 1 to 6, respectively.doi:10.1371/journal.pgen.1002191.g002

Figure 4 .
Figure 4. Recombination flux reconstructed between the five lineages.The numbers next to each edge represent the number of recombination events coming from a given origin into a given lineage.Edges with less than 3 events have been omitted.This figure was drawn using GraphViz [84].doi:10.1371/journal.pgen.1002191.g004

Figure S1
Figure S1 Posterior probability of the number of populations in Structure.(PDF) Figure S2 Procedure of Evanno et al. (2005) to determine the number of populations in Structure.(PDF) Figure S3 Result of STRUCTURE based on MLST data only.(PDF) Figure S4 Result of CLONALFRAME based on MLST data only.(PDF) Figure S5 Distribution of genetic diversity introduced by recombination events in CLONALFRAME.(PDF)