Characterization of genetic diversity and population structure within Staphylococcus chromogenes by multilocus sequence typing

Staphylococcus chromogenes is a common skin commensal in cattle and has been identified as a frequent cause of bovine mastitis and intramammary infections. We have developed a seven locus Multilocus Sequence Typing (MLST) scheme for typing S. chromogenes. Sequence-based typing systems, such as MLST, have application in studies of genetic diversity, population structure, and epidemiology, including studies of strain variation as a factor in pathogenicity or host adaptation. The S. chromogenes scheme was tested on 120 isolates collected from three geographic locations, Vermont and Washington State in the United States and Belgium. A total of 46 sequence types (STs) were identified with most of the STs being location specific. The utility of the typing scheme is indicated by a discrimination power of 95.6% for all isolates and greater than 90% for isolates from each of the three locations. Phylogenetic analysis placed 39 of the 46 STs into single core group consistent with a common genetic lineage; the STs in this group differ by less than 0.5% at the nucleotide sequence level. Most of the diversification in this lineage group can be attributed to mutation; recombination plays a limited role. This lineage group includes two clusters of single nucleotide variants in starburst configurations indicative of recent clonal expansion; nearly 50% of the isolates sampled in this study are in these two clusters. The remaining seven STs were set apart from the core group by having alleles with highly variable sequences at one or more loci. Recombination had a higher impact than mutation in the diversification of these outlier STs. Alleles with hypervariable sequences were detected at five of the seven loci used in the MLST scheme; the average sequence distances between the hypervariable alleles and the common core alleles ranged from 12 to 34 nucleotides. The extent of these sequence differences suggests the hypervariable alleles may be remnants of an ancestral genotype.

The extent of these sequence differences suggests the hypervariable alleles may be remnants of an ancestral genotype. 120 isolates collected from dairy cattle (Bos taurus) in three geographic locations, Vermont and Washington State in the United States and Belgium. This sample population allows assessment of both genetic and geographic diversity present in this species. The scheme has been designed such that the seven loci are well separated around the ca. 2.34 Mb genome of S. chromogenes to maximize the opportunity to evaluate the extent to which recombination may play a role in shaping diversity at the population level.

Bacterial strains and DNA isolation
A convenience sample of 120 isolates, from collections of three laboratories, was investigated in this study; these isolates originated from dairy cattle in Vermont (n = 46) and Washington (n = 24) in the USA and from Belgium (n = 48), and pigs in Vermont (n = 2), (S1 Table). The isolates from Belgium were collected in 2001 and 2012 from multiple farms from dairy cattle teat apex swabs (n = 20) and individual mammary quarter milk samples from apparent healthy quarters (n = 28) [14,17,19]. The Washington isolates were collected in 2010 from quarter milk samples of dairy cows with intramammary infections on 6 farms in Washington and Idaho [32]. The Vermont isolates were collected in 1998 from 3 dairy farms and in 2013 from 2 dairy farms from either quarter milk samples of cows with intramammary infections (n = 31), cow teat orifice swabs (n = 1), cow hock skin swabs (n = 3), and bulk tank milk (n = 11); two isolates originated from pig nasal swabs were collected in 2013 from one of the Vermont farms. This study was carried out in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health. The protocol was approved by the Committee on the Ethics of Animal Experiments of the University of Vermont (Protocol Number: 13-033).
Isolates from Washington were previously identified as S. chromogenes using partial 16S rRNA and rpoB gene sequencing, gap gene PCR-restriction fragment length polymorphism and API STAPH ID 20 biochemical testing (bioMérieux, Inc. Durham, NC) [32]. Isolates from Belgium were previously identified as S. chromogenes using transfer RNA intergenic spacer PCR or 16S rRNA gene fragment sequencing [14,17]. Isolates collected in Vermont in 1998 were previously identified as S. chromogenes using API STAPH ID 20 biochemical testing. Subsequently all 120 isolates were verified as S. chromogenes by sequence analysis of tuf and rpoB gene amplicon fragments with > 97% sequence identity [33,34]. The draft genome sequence of S. chromogenes strain MU970 was downloaded from the NCBI microbial genome database (GenBank accession JMJF01000000) to provide a genome reference sequence [12]. All isolates were shared between the University of California Berkeley (UCB) and University of Vermont (UVM) laboratories, and DNA extraction, amplification and analysis procedures were replicated in both labs. Strain MU970 (gift from J. Middleton, University of Missouri) was cultured for DNA extraction and sequence amplification in the UVM lab.
Each laboratory used a different pre-existing protocol for primary growth and preparation of isolates for DNA extraction. In the University of California Berkeley (UCB) lab, isolates were grown aerobically overnight in tryptic soy broth (TSB, BBL) at 37˚C without shaking and then plated on tryptic soy agar plates with 5% sheep blood (TSA, BBL) and incubated aerobically at 37˚C. Single colonies were passed from each TSA plate and grown overnight at 37˚C in 1 mL of TSB. To achieve mid-phase growth, 100μL of overnight growth was combined with 5mL TSB and incubated at 37˚C with shaking for 4 hours. Cell pellets were collected from 3 mL of mid-phase growth by centrifugation at 10,000 rpm for 5 minutes. Alternatively, in parallel, at the University of Vermont lab (UVM), a pure primary culture was grown aerobically for 48hrs on TSA, and single colonies from this growth were inoculated to 5 ml TSB, grown aerobically overnight at 37˚C and cell pellets were collected by centrifugation directly from 1.8 ml of overnight TSB culture. The cell pellets were then frozen at -20˚C until DNA extraction could be performed (UCB) or held at 4˚C and processed within 48 hours (UVM).
DNA extraction was performed using a DNeasy Blood & Tissue kit (Qiagen, Valencia, CA, USA) according to manufacturer's instructions with the modification that the initial lysis buffer was supplemented with lysostaphin (22 U/ml; Sigma-Aldrich). DNA yield and quality was assessed by electrophoresis using a 0.75% agarose gel containing the DNA stain GelStar (Lonza, Rockland, ME, USA) in 1X Tris Borate EDTA (TBE) buffer. Aliquots of DNA were stored at -20˚C.

Selection of target loci for MLST
Fifteen loci were assessed initially as potential candidates for the S. chromogenes MLST scheme. Nine loci originated from MLST schemes used for S. aureus, S. epidermidis, and S. saprophyticus [http://pubmlst.org and unpublished] and six additional loci were selected from the Gen-Bank annotation listing for the S. chromogenes MU970 draft genome. PCR primers for each of the candidate loci were designed using Primer-BLAST to yield sequence segments 700-850 bp in length. The candidate loci were evaluated for sequence variation using a panel of 27 isolates from Vermont and those exhibiting the greatest sequence diversity were selected for further evaluation. As a final filter, we sought to assess whether the loci were relatively evenly distributed around the genome to maximize the potential for diversity resulting from inter-locus recombination. This was determined initially by locating the positions of the 7 candidate loci on the complete genome sequence of the closely related species, S. hyicus [GenBank accession CP008747.1]; subsequent localization on the recently reported complete genome sequence of S. chromogenes strain 1401 [NZ_CP04602.1] established a minimum inter-locus distance of 240Kb with an average distance of 335Kb. Details for the seven loci selected for the MLST scheme are provided in S2 Table. Target gene amplification and nucleotide sequencing Target genes were amplified using the polymerase chain reaction employing a master mix containing 16.75μL DNase-free H2O, 2.5μL PCR buffer, 1.5μL 50mM MgCl2, 0.5μL 10mM dNTP, and 0.25μL 0.5 U/μL Taq polymerase (Invitrogen) per reaction. Master mix was aliquoted into tubes for each locus being amplified and 0.25μL each of 25μM forward and reverse primer was added for each reaction. Three microliters of DNA was added to 22 μL of the master mix and primers. PCR cycling included heating to 95˚C for 7 minutes, followed by 35 cycles of 94˚C for 45 seconds, 55˚C for 45 seconds and 2 minutes of 72˚C. On the last cycle, the samples were heated to 72˚C for 5 min. The samples were then held at 4˚C until further analysis could be completed.
Amplification products (3μL) were evaluated by electrophoresis at 150V for approximately one hour on 1.5% agarose gel containing the DNA stain GelStar. PCR products exhibiting a single band at the predicted amplicon size were processed for sequence analysis by the UC Berkeley sequencing facility. PCR amplification and target gene sequencing were replicated at UVM and the replicate sets of amplicons were processed for sequence analysis by the University of Vermont DNA sequencing facility.

Analysis of MLST sequence data
Raw sequence data containing both forward and reverse reads were recorded in FASTA format for analysis. Sequences were aligned using the MUSCLE function in MEGA 6.06 [35]. After alignment, single strand overhangs and any ambiguous reads were trimmed from the ends of each sequence. Any missing or ambiguous nucleotides were resolved by reviewing the trace data using the FinchTV 1.4.0. viewer. Once a consensus sequence was determined for each sample, all sequences for a single locus were combined into a single FASTA file. The sequences were trimmed again to obtain a standard length. Sites exhibiting single nucleotide polymorphism (SNP) were identified in MEGA. For each of the seven loci, the gene sequence present in the MU970 reference sequence was arbitrarily defined as allele 1 and new alleles were identified by pairwise comparison of SNP sites; each new allele was assigned a number. Sequence types (STs) were defined by unique allelic profiles at the seven loci. MEGA and DnaSP 5.10 were used for assessment of population genetics parameters such as nucleotide diversity and allele (haplotype) diversity [35,36]; the haplotype diversity is a measure of the discrimination power of the typing system [37]. DnaSP was also used to concatenate the sequences of the seven loci for each ST.
The 7-locus concatenated nucleotide sequence data were used for the construction of phylogenetic trees generated by the neighbor joining (NJ) algorithm in MEGA with 1000 bootstrap replications. Clonal clusters of sequence types were identified at the level of single locus variants (SLVs) and double locus variants (DLVs) using the web program eBurst3 as implemented in goeBURST [38,39]. Evidence for recombination was assessed by surveying allelic sequences at each locus within clonal subgroups delineated by eBURST and phylogenetic analyses: alleles at a locus within a clonal subgroup differing at a single nucleotide site were scored as mutations whereas alleles differing at multiple nucleotide sites and alleles shared between different clonal subgroups were scored as recombination events [40]. Recombination between loci was assessed using the four-gamete test [41] as implemented in DnaSP; this test detects the minimum number of recombination events (RM) in the history of the sample. For this test, the concatenated sequences were constructed with the first locus (arcC) sequences appended to the end of the 7-locus concatenation to detect possible recombination between the last and first locus in the circular genome. The pairwise homoplasy index (PHI) for recombination was measured using the program implemented in SplitsTree [42,43]; recombination within and between sequences was considered positive if the PHI test yielded a p � 0.05.

Genetic diversity in S. chromogenes
The MLST scheme for S. chromogenes is based on characterization of nucleotide sequence variation in fragments of seven housekeeping genes in 120 S. chromogenes isolates. Overall, 216 nucleotide substitutions at 213 sites were identified in the 4563 bp of genome sequence covered by the scheme ( Table 1). The 216 nucleotide substitutions resulted in 57 amino acid replacements, a replacement rate of 26.4%. The number of alleles detected at the seven loci ranged from 9 to 21; the majority of alleles differ in amino acid sequence as well as nucleotide sequence. The average number of nucleotide differences per site at each locus in the sample population of 120 isolates is indicated by the nucleotide diversity (π p ). The allelic diversity (Hd) reflects the probability that any pair of isolates drawn from the sample population will carry different alleles at a locus; it is a measure of the discrimination power of the locus in the typing system [37]. The arcC locus exhibits the greatest nucleotide diversity followed by dnaJ and glpF among the 7 MLST loci; however, glpF is superior to arcC and dnaJ with regard to discrimination power. Sequences of the alleles at the each of the 7 loci are available at the PubMLST database (https://pubmlst.org/schromogenes).
A total of 46 distinct Sequence Types (STs) were identified in the sample population; the 7-locus allelic profiles of the 46 STs are listed along with their geographic origins in Table 2 and at the PubMLST database (https://pubmlst.org/schromogenes). By convention, the allele sequences in the reference strain MU970 were defined as allele 1 with the corresponding 7-locus allelic profile of ST1 for MU970. The average nucleotide diversity for the 46 STs is 0.00507; this is somewhat lower than the values 0.0068, 0.0064, and 0.010 for S. aureus, S. epidermidis, and S. hominis respectively [28] but higher than the 0.0021 value for S. carnosus [31] and much higher than the 0.00035 value for S. haemolyticus [27]. ST1 was the most common sequence type observed in the sample population; it was detected in isolates from all three source locations though primarily (17 out of 18) from the two US locales. Only three other STs were found in multiple locales: ST6 and ST18 in Vermont and Belgium and ST15 in both US locales. The remaining 42 STs were detected in only one of the three locales (Table 2). ST1 plus three additional STs (ST28, ST15, & ST5) account for over 1/3 (n = 44) of the isolates in the sample population. At the other end of the frequency spectrum, 24 STs were found only as single isolates. The remaining 52 isolates are distributed among 18 STs containing 2-4 isolates each. This spectrum of isolate distribution among STs is commonly observed in MLST characterization of other staphylococcal species: typically a relatively small number of STs contain a large proportion of the total isolates whereas many STs, often over half, are represented by single isolates [27][28][29][30][31].
The discrimination power of the 7-locus MLST scheme for strain characterization within the overall population is 95.6% ( Table 1). Calculation of discrimination powers based only on the isolates present in each of the individual geographic populations ranged from 90.2% for the Vermont cohort to 93% for the Belgian cohort. This indicates that each of the three sample populations is genetically diverse despite apparent nearly complete genetic isolation from each other. These values are in the range typically seen with MLST schemes for staphylococcal species.
Previous studies have implicated S. chromogenes as a frequent cause of intramammary infections in cattle on dairy farms and have demonstrated strain diversity can vary within and between farms. However the DNA fragment profiling methods used in these studies to characterize strain diversity, notably PFGE, ALPF, RAPD-PCR, and MLVA, have yielded conflicting estimates of diversity, ranging from a high degree of genetic conservation to considerable genetic heterogeneity [10,11,18,19,22,23,[44][45][46][47][48][49]. Given the stand-alone nature of these studies and the incompatibility of their measures of strain diversity, the results of different studies cannot be directly compared or consolidated. In contrast, the intrinsic portability of MLST sequence data coupled with the high discrimination power of this MLST system provide a uniform approach for wide-ranging investigations of strain diversity within and between locations or over time and for testing associations between strain type and isolate sources, disease states, and host species. Moreover, isolates represented in the ever expanding whole genome sequence databases can be characterized using MLST sequence data extracted in silico from the databases, thus allowing inter-compatibility of datasets from different studies [50]. It is to be noted that the seven gene loci employed in the MLST scheme for S. chromogenes are also present in the three related species: S. hyicus, S. agnetis, and S. felis [3]. A preliminary survey of the genomes of these three species indicate that all seven loci are genetically polymorphic, thus providing the potential for development of MLST schemes based on the same seven loci for each of the three species. With a sequence identity of less than 90% between the genes in the different species, the possibility of confusing the different MLST schemes is effectively eliminated.

Population structure and geographic origins
Characterization of population structure using the eBURST algorithm groups STs according to the number of allele differences at the 7 loci; this approach disregards the extent of sequence difference between alleles. Initial analysis at the single locus variant (SLV) level revealed two clonal clusters, one centered on ST1 with 11 satellite STs and the other centered on ST6 with 7 satellite STs; in addition, there were several ST pairs and triplets. The ST6 cluster included ST28, the second most common ST in the population with nearly four times as many isolates than ST6, prompting the question of whether ST28 might be the founder of the cluster. Investigation at double locus variant (DLV) level showed 33 STs connected in a single network with ST28 at the central node with radiations leading to four secondary nodes centering on ST1, ST6, ST15, and ST38 (Fig 1). The 33 STs in this core network account for 96 of the 120 isolates in the sample population. The 13 STs not included in this network are separated from the network and from each other by sequence differences at 3 or more loci.
The ST1 and ST6 nodes connect to ST28 directly, ST1 as DLV and ST6 as a SLV. In terms of nucleotide distances, ST28 and ST1 differ at 4 SNP sites (3 in glpF and 1 in pta); ST28 and ST6 differ at 7 SNP sites in dnaJ allele 3. The glpF and dnaJ allele differences are more likely the result of recombination events given that the sequence motifs of both alleles are present in STs within and outside the common network. The ST15 and ST38 nodes, in contrast, connect to ST28 via intermediary DLV STs: ST17 and ST40 respectively. Despite this, the two nodes are relatively close to ST28 in nucleotide distance, differing at 6 SNP sites for ST15 and 7 SNP sites for ST38, again likely involving recombination events.
The cluster around ST1 consists entirely of single locus variants (SLVs), each bearing a different single nucleotide substitution. This starburst pattern is indicative of a recent clonal expansion with ST1 as the founder. Of the 11 STs in the ST1 cluster, all but one, ST34, originate from Vermont or Washington farms. The cluster around ST 15 is also primarily associated with isolates from Vermont and Washington farms. Unlike the ST1 cluster, the ST15 cluster consists mostly of DLVs. Despite this, the average nucleotide distance between ST15 and its satellites is 2.2.
The STs in the ST6 cluster and the ST38 cluster are predominantly of Belgian origin. The ST6 cluster consists of SLVs in which all but one of the linkages involves loci differing at a single SNP site. Again, this starburst pattern is indicative of a recent clonal expansion with ST6 as the founder. The cluster around ST38 contains more DLVs than SLVs. One ST in the ST38 cluster, ST44, separates itself from the other STs in the cluster by differing from them at an average of 58 SNP sites.
The geographic partitioning of isolates within the core ST network suggests that S. chromogenes populations are relatively isolated, more so between Belgium and the United States than between Vermont and Washington within the U.S. Two lines of evidence suggest Europe as the historical origin of the strains of S. chromogenes investigated in this study: (a) Belgium is home to more STs in the core network than either of the U.S. source locations, and (b) the central node of the core network, ST28, is of Belgian origin. This hypothesis can be tested by characterizing MLST databases representing more geographically diverse sample populations should they be available in the future.

Phylogenetic analysis distinguishes core and outlier STs
Phylogenetic analysis based on overall nucleotide sequence variation between the 46 STs provides an alternative perspective on the population structure of S. chromogenes (Fig 2). As The large cluster (Fig 2B) includes 32 of the 33 STs in the eBURST core network plus 7 STs differing at 3 or more loci from those in the eBURST core group (STs 13,20,22,35,37,41,& 46). The one member of the eBURST core network that placed outside the phylogenetically defined large group was ST44, previously noted as differing substantially at the sequence level from the other STs in the eBURST network. Within the large cluster, only the STs in the ST Gp1 and ST Gp6 appear as unified clusters with strong bootstrap support; the remaining STs, including the 7 STs noted above, are interspersed on variably supported branches. The mean pairwise nucleotide distance between the 39 STs in the group is 9.6 (range 1-22), a relatively small increase over the mean distance of 8.1 (range 1-16) between the 32 STs in the eBurst core network. This increase in nucleotide distance is accounted for by the additional sequence variation present in the STs varying at three or more loci compared to the STs in the eBURST network which are single or double locus variants. The conjoining of the 32 STs in the eBURST network with the seven additional STs in the phylogenetic analysis is thus consistent with all 39 STs sharing a common genetic lineage that has undergone diversification. This grouping includes 105 of the 120 isolates in the total population set.
The placement of the remaining seven STs as outliers to the common core cluster does not reflect meaningful phylogenetic relationships. Rather it is the consequence of these seven STs carrying hypervariable allelic variants at one or more of the MLST loci. Pairwise comparisons of allele sequences at each of the 7 MLST loci show alleles at five loci partition into two classes, one consisting of alleles typically differing at 4 or fewer nucleotides and a second smaller group differing by 10 or more nucleotides from the first group. The alleles in the first group comprise the allelic composition of the 39 STs in the common core cluster and are designated here as common core alleles; alleles in the second group are designated hypervariable (HV). Table 3 compares the relationship of the two classes of alleles in terms of the average pairwise nucleotide distances within and between the classes. It is clear the distances between the two classes are substantially greater than the within-class distances. Two loci, hutU and menF, lack HV alleles.  To illustrate the effect of a single HV allele in a MLST profile, STs 26 & 39 have HV alleles only at the arcC locus and are clear outliers in the 7-locus phylogeny (Fig 2A) but a phylogeny built on the six loci excluding arcC results in a repositioning of these two STs within the common core cluster. The outliers ST44 and ST23 differ from the common core with HV alleles at two and three loci respectively. The remaining three outlier STs ( STs 18,25,& 43) have HV alleles at the five loci and fall into a well-supported group with average nucleotide distances of 105.5 to 108.6 separating these three from the 39 STs in the common core cluster. Notably, these three STs also differ significantly from each other with an average pairwise nucleotide difference of 30.7 between them. These three STs represent 6 isolates of which 4 originate from Belgium and one each from Vermont and Washington State.

Evidence of recombination in S. chromogenes
The contribution of mutation and recombination to the generation of genetic diversity varies considerably among staphylococcal species. In S. aureus, for example, new alleles are predominately generated by point mutations whereas in contrast, recombination and mutation contribute almost equally to the generation of new alleles in S. haemolyticus and the ratio is about 2:1 favoring recombination in S. epidermidis [26,27,51]. Species that undergo very low rates of recombination have population structures characterized by clonal lineages that diversify slowly by the accumulation of point mutations. At the other end of the spectrum, species that undergo frequent recombination can exhibit a level of genetic diversity that complicates phylogenetic analysis and reconstruction of population structure.
The pairwise homoplasy index (PHI) was used to gain an initial assessment of recombination among the concatenated sequences of the 32 STs in the eBurst network, the 39 STs in the common core, and the 46 STs in the full data set. No statistically significant evidence of recombination was detected for the core 32 STs (p = 0.80), but recombination was indicated for the 39 STs in the common core (p = 0.018) and very strong evidence for recombination was found for the full set of 46 STs (p<0.0001). To characterize the distribution of recombination events within and between loci, the "four gametes" test of Hudson and Kaplan [41] was used; this test yields the minimum number of recombination events between SNP positions in the concatenated ST sequences. Detection of recombination between MLST loci is of particular interest for it indicates expansion of genomic diversity beyond that provided by allele sequence variation. For the 32 ST sequences in the eBURST core complex, four inter-locus and no intralocus recombination events were detected; the inter-locus recombinants were arcC/fumC, fumC/dnaJ, dnaJ/glpF, and glpF/menF. Analysis of the 39 ST sequences in the common core group added one more inter-locus recombination event, menF/arcC, plus an intra-locus recombination event in dnaJ. Analysis of all 46 ST sequences added 10 more within-locus events: 6 in arcC, 2 in fumC, and 2 in glpF for a total of 16 minimum recombination events overall. Analysis of the outlier 7 ST sequences accounted for 11 of these events, the five between loci and six within loci. These findings are consistent in showing that recombination contributes to genetic diversification in S. chromogenes, particularly in the STs with HV alleles.
To assess the relative contributions of mutation and recombination events at the allele level, allelic sequence changes were surveyed at each locus within the nodal subgroups delineated by eBURST. Alleles differing at a single nucleotide site were scored as mutations whereas alleles differing at multiple nucleotide sites and alleles shared between different clonal subgroups were scored as recombination events [40]. Notably, the defining allelic signature of three of the four nodal subgroups can be attributed to recombination events contributing one or more new alleles to the allelic profile of ST28, the central node. The nodes of the ST1 and ST6 nodal subgroups differ from ST28 by recombined alleles at the glpF and dnaJ loci respectively. In contrast, the allelic differences between the STs within each nodal cluster are single nucleotide substitutions in keeping with the starburst topologies of these two nodal clusters. The ST15 nodal subgroup differs from ST28 with recombinant alleles at both the arcC and glpF loci; single nucleotide variants account for the remainder of the variation within this nodal cluster. The nodal subgroup around ST38 presents a different picture. Of the 10 allele changes occurring within the six STs in this subgroup, four can be attributed to recombination and the remaining six to mutation; thus both single site substitutions and recombination events contribute to the differences between STs within the subgroup. Overall, this assessment indicates the ratio of recombination to mutation to be about 8:32 in the 32 STs comprising the eBURST clonal network. Notably, there is only one example of allele sharing between STs in different nodal subgroups in the eBURST network: the variant allele glpF-3 is shared between multiple STs in nodal subgroups ST15 and ST38. This allele is also shared with multiple STs outside the eBURST clonal network, validating its status as recombinant.
In contrast to the predominance of mutation over recombination in the eBURST clonal network, recombination events predominate in the seven STs containing HV alleles. Indeed, that these seven STs are comprised of mixtures of common core and HV alleles is indicative of recombination. Comparison of HV allele sequences at each of the five loci with HV alleles provides an estimated recombination to mutation ratio of 13:5. The phylogenetically supported branch containing ST18, ST25, and ST43 (Fig 2A) allows direct comparison at the ST sequence level and yields a recombination to mutation ratio of 8:2. The predominance of recombination to mutation among the HV alleles is indicative of the deeper ancestry of these alleles compared to those of the common core.

Hypervariable alleles-remnants of a relict genotype?
The extreme sequence variation in the HV alleles relative to the common core alleles prompts the question of the origin of these alleles. To test the possibility the HV alleles are introgressions from other species, BLAST searches were done querying representative HV alleles against all genomes in the genus Staphylococcus; no hits above 80% sequence identity were observed for any species other than S. chromogenes. Additionally, both the common core and HV allele sets are equidistant from the corresponding genes in S. hyicus reference sequences, consistent with expectation for common ancestry. An alternative hypothesis is that the HV alleles are remnants of a relict genotype. The large average pairwise SNP distances separating HV and common core alleles is indicative of an early time of divergence between the two classes of alleles (Table 3). The hypothesis that the HV alleles are remnants of a lineage older than the common core alleles is supported by the larger number of variant sites per allele for the HV alleles than for the common core alleles (6 vs. 1.04) and the higher average frequency of synonymous site variants in HV alleles than in the common core alleles (86.7% vs. 49.4%) [52]. Additional support for this hypothesis is the increased incidence of recombination relative to mutation in the HV alleles compared to the common core alleles; sequence variation due to recombination tends to accumulate over time.
The MLST profiles of the 7 outlier STs contain both common core and HV alleles, ranging from one to 5 HV alleles in an MLST profile. These mixed profiles are most likely to have arisen via recombination; mutational variation is not a plausible alternative. It is not possible to ascertain from the MLST data alone whether the mixtures are a result of introgression of non-HV alleles into an HV genome or the other way around. However, the apparent recent origin of the common core alleles and the possible relict origin of the HV alleles suggest the mixtures are relatively recent. A more detailed picture of the population history of S. chromogenes awaits further study using whole genome sequence data.

Conclusions
The MLST scheme described in this paper provides a tool for the differentiation and identification of strains within S. chromogenes. With a power of discrimination between strain types exceeding 90% in geographically localized populations and greater than 95% overall, this MLST scheme has potential for use in epidemiological investigations of pathologies associated with this species and the ecological relationships between microbe and host. The geographic distribution of strain types indicated a high degree of genetic isolation between locales, posing a question of the historical and genetic factors accounting for this separation. Phylogenetic analysis of strain types identified by the scheme showed most to be contained within a single large and genetically diversified lineage which included strains arising from mutation driven clonal expansions and more varied strains generated by recombination events. The MLST analysis also revealed that some strain types were differentiated by having alleles with highly variable sequences at one or more of the loci in the 7-locus MLST scheme; these highly variable alleles were posited to be remnants of a relic genotype of S. chromogenes. These features of the population structure of this species provide a prospectus for future studies.
Supporting information S1