Using Whole Genome Analysis to Examine Recombination across Diverse Sequence Types of Staphylococcus aureus

Staphylococcus aureus is an important clinical pathogen worldwide and understanding this organism's phylogeny and, in particular, the role of recombination, is important both to understand the overall spread of virulent lineages and to characterize outbreaks. To further elucidate the phylogeny of S. aureus, 35 diverse strains were sequenced using whole genome sequencing. In addition, 29 publicly available whole genome sequences were included to create a single nucleotide polymorphism (SNP)-based phylogenetic tree encompassing 11 distinct lineages. All strains of a particular sequence type fell into the same clade with clear groupings of the major clonal complexes of CC8, CC5, CC30, CC45 and CC1. Using a novel analysis method, we plotted the homoplasy density and SNP density across the whole genome and found evidence of recombination throughout the entire chromosome, but when we examined individual clonal lineages we found very little recombination. However, when we analyzed three branches of multiple lineages, we saw intermediate and differing levels of recombination between them. These data demonstrate that in S. aureus, recombination occurs across major lineages that subsequently expand in a clonal manner. Estimated mutation rates for the CC8 and CC5 lineages were different from each other. While the CC8 lineage rate was similar to previous studies, the CC5 lineage was 100-fold greater. Fifty known virulence genes were screened in all genomes in silico to determine their distribution across major clades. Thirty-three genes were present variably across clades, most of which were not constrained by ancestry, indicating horizontal gene transfer or gene loss.


Introduction
Staphylococcus aureus, a major human pathogen that can cause skin and soft tissue infections as well as fatal disease due to pneumonia, endocarditis and osteomyelitis, continues to be of concern in both hospital and community settings, especially given the high rates of antibiotic resistance [1]. Resistance to beta-lactams, including methicillin, in S. aureus was first detected in 1961 [2] and methicillin-resistant S. aureus (MRSA) continues to cause significant disease and mortality today. In 2005, more deaths occurred from MRSA infections in the US than from AIDS [3]. Understanding the evolution of this pathogen is important, but there have been few analyses employing whole genome sequencing, looking at the overall relationships among different clonal groups of S. aureus.
There have been several whole genome studies in S. aureus focused on a single sequence type. Harris et al. 2010 [4] sequenced 63 isolates of sequence type (ST) 239, a globally disseminated healthcare-associated clone defined by multi-locus sequence typing (MLST). They presented data showing that CGS analysis revealed the global geographic structure for ST239 and demonstrated the possibility of using this technique to track transmission within a single hospital. In another clade-specific S. aureus study [5], next-generation sequencing was used to analyze 89 strains of clonal complex (CC) 398, a predominant livestock-associated S. aureus lineage. The resolution of whole genome SNPs allowed for the determination that CC398 likely has its origins as methicillin-sensitive S. aureus in humans, instead of animals, as was previously thought. A recent study of ST22 [6] allowed for phylogenetic reconstruction of an important European clone, EMRSA-15 and estimations of evolutionary rates and most recent common ancestor using Bayesian analysis.
Additionally, evidence for recombination has been detected in S. aureus, involving mobile genetic elements, homologous recombination as well as large-scale chromosomal replacements and our analysis supports recombination occurrences across the S. aureus genome. Monecke et al. [7] describe a high rate of genetic recombination from a microarray analysis of 3000 clinical and veterinary isolates; however they restricted their conclusions to the mobilome and did not intimate a recombination impact on the core genome. Other whole genome sequence analyses have been typically restricted to single lineages [4,[8][9][10] and have not been able to evaluate the level of recombination in S. aureus as a whole or compare across lineages. Early studies hinted that some recombination has occurred, but suggest that most S. aureus clonal variants have arisen by point mutation [11]. Takuno et al. [12] examined twelve published S. aureus genomes and detected homologous recombination at a relative rate of 0.6 to the mutation rate. However, the impact and possible differing levels of recombination across diverse sequence types in S. aureus remains undefined. A recent analysis across S. aureus lineages in a specieswide study found that widespread homologous recombination exists and mobile elements are associated with the strongest hotspots of recombination [13].
Horizontal transfer of virulence genes can be a source of diversity and differing success among lineages. Virulence genes have been well studied in S. aureus [14,15] and transfer of those genes has been intimated as a factor in the emergence of new strains of MRSA [16]. In addition to phylogenetic reconstruction, whole genome sequencing allows for the analysis of gene content across analyzed genomes [17].
In this study, we used whole genome sequencing of many of the dominant S. aureus lineages from the US and other regions of the world to: 1) infer genomic evolutionary patterns among lineages 2) examine recombination in S. aureus across and within clonal complexes 3) estimate mutation rates and the time to most recent common ancestor using Bayesian analysis and 4) determine the distribution of known virulence genes. We found few differences from MLST, a high level of recombination between, but not within clonal complexes, differing mutation rates between two common lineages and evidence of horizontal gene transfer or gene loss in most of the virulence genes interrogated. This study spotlights the evolution of two clinically important lineages (CC5 and CC8) and adds to our understanding of the underlying patterns of evolution in the species as a whole.

Bacterial Isolates
Whole genome sequencing (WGS) analysis was performed on a convenience set of a total of 35 isolates: thirty S. aureus strains using the Illumina Genome Analyzer IIx (GA IIx ) (Illumina Inc, San Diego, CA) and five S. aureus strains from MLST-derived Clonal Complex 8 (CC8) using the SOLiD system (Life Technologies Corp, Carlsbad, CA). Strains were selected based on diversity of previously performed PFGE typing to represent genetic diversity, although no PFGE was performed for this study. Twenty-four typed strains were acquired from the National Antimicrobial Resistance in Staphylococcus (NARSA) (http://www.narsa.net/content/ home.jsp) repository where they were characterized by PFGE and SCCmec type. Four strains were selected from the ICARE study at Emory University [18] and were previously characterized only by PFGE, and one additional strain was provided by Laboratory Sciences of Arizona and was characterized by PFGE and the DiversiLab system (bioMérieux, Durham, NC) using rep-PCR. These strains included representatives of the majority of the PFGE types previously described in the United States [19]. Statens Serum Institute (SSI) in Denmark, also provided two strains from the dominant European community-acquired (CA) MRSA lineage, CC80. Strains from CC8 and CC5 as determined by MLST were heavily represented. One additional strain, N315, was re-sequenced as a confirmation of the analysis, but data from that strain is not included here. In addition, 29 publicly available whole genome sequences were used in the analysis (S1 Table). The majority (n = 52) of the strains were MRSA, although a few (n = 12) methicillin-sensitive S. aureus (MSSA), vancomycin-intermediate S. aureus (VISA) and vancomycin-resistant S. aureus (VRSA) were included.
Bacterial culture and genomic DNA preparation S. aureus strains were grown on blood agar (Hardy Diagnostics, Santa Maria, CA) for 24 hours. Genomic DNA was extracted using a Qiagen DNeasy Blood and Tissue Kit as per manufacturer's instructions (Qiagen, Valencia, CA), with the addition of lysostaphin (Sigma-Aldrich, St Loius, MO) at 200ug/mL to the enzymatic lysis buffer and an incubation from 1 to 5 hours. Following extraction, quantification was performed on a NanoDrop 8000 (Thermo Scientific, Waltham, MA) with verification by agarose gel electrophoresis. Extracted DNA quantities were normalized to 10 to 15ng/uL in 200 uL, for final yields of 2 to 3 ug before library preparation.

Indexed genomic library preparation and DNA sequencing
For isolates sequenced on the Life Technologies SOLiD platform, mate pair libraries were prepared using the manufacturer's protocol (Life Technologies Corp). Briefly, sheared, methylated chromosomal DNA was prepared, adaptors were ligated, and the fragments were circularized. Digestion with EcoP151 resulted in 25-base pair tags separated by a distance corresponding to the initial library fragment size. After ligation of sequencing adaptors, library fragments were amplified and attached to beads by emulsion PCR, and the 25-base pair tags were sequenced.
For isolates sequenced on the Illumina GA IIx (Illumina Inc) instrument, DNA samples were prepared for multiplexed, paired-end sequencing following the manufacturers protocol. For each isolate, 1-5μg dsDNA in 200μl was sheared and then purified using the QIAquick PCR Purification kit (Qiagen). Enzymatic processing of the DNA followed the guidelines as described in the Illumina protocol, with the exceptions of processing enzymes obtained from New England Biolabs (New England Biolabs, Ipswich, MA) and oligonucleotides and adaptors from Illumina (Illumina Inc). After ligation of the adaptors, the DNA was run on a 2% agarose gel for 2 hours; subsequently a gel slice containing 500-600 bp fragments of each DNA sample was isolated and purified using the QIAquick Gel Extraction kit (Qiagen). Individual libraries were quantified with qPCR on the ABI 7900HT (Life Technologies Corp.) using the Kapa Library Quantification Kit (Kapa Biosystems, Woburn, MA). Based on the individual library concentrations, equimolar pools of twelve S. aureus libraries were prepared at a concentration of at least 1nM. To ensure accurate loading onto the flowcell, the same quantification method was used to quantify the final pools. The pooled libraries were sequenced on the Illumina GA IIx using "Genomic DNA sequencing primer V2" for 36 cycles. A 50 or 100bp read paired end run was used for all isolates. An average total of 1.76 million reads was obtained for each sample.
MLST, spa type and virulence genes from whole genome sequencing Reads from the Illumina GA IIx were aligned against MLST variants for each of seven housekeeping genes using Lasergene's Seqman NGEN version 2.2 software (Lasergene, Madison, WI), producing consensus sequence for each allele. For the SOLiD sequences, reference-based assemblies were performed using Life Technologies' SOLiD system Analysis Pipeline Tool (Corona Lite). De novo assembly was performed for all genomes using the correction algorithm Spectral Alignment Error Correction followed by Velvet [20]. Assembled contigs were mapped to the MLST reference genes using MUMmer [21]. Consensus sequence for each of the genes was entered into the Locus and Allelic Profile Query on MLST.net to produce sequence types (STs). Additionally, MLST sequences from each query genome were concatenated and a maximum parsimony phylogenetic tree was generated using MEGA version 5 [22,23] for comparison to the whole genome SNP tree. Clonal lineage associations were also determined by MLST mapping on www.spaserver.ridom.de or by comparison to similar spa types or spa repeat compositions to MLST analyzed isolates at the Danish National Staphylococcal Reference Laboratory at Statens Serum Institut, with subsequent grouping using eBURST at http://saureus.mlst. net/eburst/.
spa typing of all isolates was performed in silico from de novo assembled contigs using either Velvet 1.1 [20] or CLCbio's Genomics Workbench 5.5 (CLCbio, Aarhus, Denmark). spa types were analyzed using Ridom StaphType (Ridom GmbH, Münster, Germany). Six strains were spa typed using conventional PCR and Sanger sequencing as previously described [24], five to confirm the spa types determined in silico and one from which a spa type could not be determined in silico.
Known S. aureus virulence genes were queried in the whole genome sequence data of all study strains. Panton-Valentine leukocidin S (lukS-PV) and Panton-Valentine leukocidin F (lukF-PV) (accession number AB186917.1) were used as alignment references for the sequencing reads in the Seqman NGEN V. 2.2 software. Presence of the genes was confirmed by visualization of read alignments across the entire 1,918bp region. To further assess the virulome, blast score ratio (BSR) analysis was used as previously described [17,25] to query known virulence genes (S2 Table) and their homologs across all S. aureus strains. Briefly, TBLASTN [26] was used to align the peptide sequence of each virulence factor against all sequenced genomes, producing a query bit score. The score for each genome alignment was divided by the maximum bit score, produced by a self-alignment, to obtain the BSR for each virulence gene. The BSR value can range from 1.0 (100% similarity across 100% of the peptide) to 0 (no significant alignment). The Multi-experiment Viewer [27] was used to visualize the BSR values only across groups for which there were differing scores.

SNP analysis
In order to determine the core genome SNPs, sequences were aligned against FPR3757, a closed USA300 S. aureus reference genome [28] for both the overall tree analysis and the CC8 strains. CC5 strains were aligned similarly against a closed S. aureus ST5 reference genome, N315 [29]. BFAST [30] was used for all alignments. Indels and reads mapping to multiple locations were removed from the final alignments. Each alignment was analyzed for single nucleotide polymorphisms (SNPs) using SolSNP (http://sourceforge.net/projects/solsnp/). Only loci that had a minimum coverage of 10X and the base variant was present in greater than 90% of the calls, were included in the final analysis. Additionally, duplicated regions were identified by a selfcomparison of FPR3757 or N315 using MUMmer version 3.22 [21] and SNPs within these repetitive regions were removed. The 29 publicly available genomes were aligned using MUMmer/Nucmer. Results from SolSNP and the whole genome alignments were merged using a custom script. Importantly, only loci present in all strains were included and a matrix containing the core, orthologous SNPs was generated.

Phylogenetic trees
To obtain phylogenetic trees, the matrix generated as described above was analyzed using maximum parsimony in MEGA version 5 [22,23] and bootstrapped with 100 replicates. For groups with limited genetic variation and limited recombination such as a single species or within a single species, maximum parsimony is ideal for phylogenetic reconstruction [31] and provides for the use of homoplasy metrics that are the best indicators for phylogenetic accuracy in groups with little homoplasy [32] and can be used to identify recombined regions [5]. Additionally, phylogenetic trees were reconstructed using maximum likelihood for confirmation of results. The Hasegawa, Kishino, and Yano (HKY) model of nucleotide substitution [33] was incorporated, as this model had the lowest Bayesian Information Criterion score in a model comparison conducted in MEGA version 5 [23]. Strain relationships were largely robust to the choice of phylogenetic method. The first overall tree constructed included a publicly available whole genome sequence of Staphylococcus epidermidis; RP62A (accession number: NC_002976) as an out-group to determine the root of the tree. All subsequent trees, including the CC8 and CC5 trees, were rooted with the most basal taxa from that subgroup.

SNP and homoplasy density
Recombined regions may contain a higher density of SNPs and the phylogenetic signal from those SNPs will conflict with the signal from clonally inherited regions. To determine the location and frequency of recombination, the SNP density was calculated using 1-kb non-overlapping regions that were taken from the reference genome, FPR3757. Each region was pulled from the SNP matrix generated as described above and parsimony informative (PI) sites were tabulated. To identify homoplastic SNPs, a maximum parsimony tree was inferred with PAUP v4.0b10, and a SNP was determined to be homoplastic if it had a CI value 0.5. All homoplastic SNPs were coordinated with the 1-kb fragments, and the ratio (homoplasy density) of homoplastic SNPs to all PI SNPs was calculated. The number of PI SNPs and the homoplasy density across each 1-kb window was plotted with Circos [34].

Molecular clock analysis
To estimate evolutionary rates and divergence times of the different clonal complexes, we employed a Bayesian molecular clock method as implemented in the BEAST v1.8.0 software package [35]. First, SNPs with a Retention index (RI) value of < 0.5, as calculated by Paup v4a140, were manually removed from the multiple sequence alignment to filter recombinate regions from the data set. Similarly to the phylogenetic reconstruction using maximum likelihood, the HKY model of nucleotide substitution [33] was incorporated again here to describe nucleotide substitution patterns among taxa. Because only variable sites were included in this analysis, we implemented an ascertainment bias correction model, as done in Gray et al, [36]. Path sampling [37] and stepping stone [38] sampling marginal likelihood estimators were employed to determine the best fitting clock and demographic model combinations [39,40] (S3 Table). These methods of statistical model selection indicated that the combination of the uncorrelated lognormal molecular clock and the nonparametric Bayesian skygrid models best fit the data. The relaxed uncorrelated lognormal molecular clock model was used to infer the timescale and mutation rates while allowing for rate variation among lineages [41] with a gamma distribution prior on the mean clock rate (shape = 0.001, scale = 1000) and an exponential prior (mean = 1/3) on the standard deviation as recommended by Faria et al. [42]. Three independent Markov chain Monte Carlo (MCMC) chains were run for 500 million generations each, with parameters and trees drawn from the posterior every 50,000 th step. Log-Combiner [35] was used to merge the samples from each chain, and the first 50% of each chain was discarded as burn-in. Visual trace inspection and calculation of effective sample sizes was conducted using Tracer [43] and confirmed MCMC mixing within and among chains. The posterior mean and 95% confidence intervals have been reported for the evolutionary rate and time to most recent common ancestor estimates.

Results
Using next-generation sequencing technologies, we sequenced and mapped genome-wide core SNPs in 35 diverse strains, plus 29 publically available whole genome sequences of S. aureus. In addition, we used whole genome data to determine MLST and clonal complex assignments, SCCmec type, PVL status and spa type in silico for all strains, both publically available and newly sequenced (Table 1). Strains were chosen to represent much of the diversity seen in the US in S. aureus, but they do not represent the total diversity of the species. The number of reads per genome and statistics on the novel genome assemblies are presented in S4 Table and characteristics on all strains in the study are in S1 Table. All read data from the 35 strains sequenced in this study are deposited at NCBI in the short read archive under the BioProject accession number PRJNA214785.

Phylogenetic trees and Recombination
The maximum parsimony phylogeny presented in Fig 1 was reconstructed using 80,836 SNPs identified in comparison to a S. aureus CC8 reference (FPR3757). Of the total SNPs, 57,236 were parsimony informative. The phylogeny had a consistency index (CI), excluding parsimony uninformative SNPs, of 0.59 which indicates a moderate level of homoplasy. The majority of bootstrap support values (41/52) were greater than 90%, indicating strong confidence in the groupings. The phylogenetic tree was rooted by the CC45 clade, which was determined to be most basal from a tree rooted with a near-neighbor, S. epidermidis (S1 Fig). This original tree included the northern Australian CC75 isolate, MSHR1132 (accession number: FR821777) however; this genome was removed for subsequent trees because of its large patristic distance from the other strains included in this study. MSHR1132 is more than 17,500 SNPs distant from the nearest S. aureus strain and nearly 40,000 SNPs distant from S. epidermidis and clearly represents a unique, basal clade as has previously been reported [7,44]. Strains TW20, T0131 and JKD6008 are closely related to the CC8 clade and share a recent common ancestor. These strains belong to the MLST sequence type ST239, a hybrid known to have resulted from a large-scale recombination event whereby an ST8 strain acquired an approximate 635 kb region from an ST30 donor [45,46], which is consistent with their placement on the overall phylogenetic tree. Although ST239 belongs to CC8 using MLST, we analyzed those strains separately from other CC8 strains, as they are quite distant from the other CC8's (3943 SNPs) and clearly represent a distinct clade. Homoplasy density within closely related taxa can indicate recombination and PI SNP density can highlight regions of recombination from outside groups. When plotted across the entire chromosome in a circular schematic diagram to identify recombined regions from both closely related taxa and outside groups, the homoplasy and PI SNP density patterns of the core genome SNPs (Fig 1A) showed many more recombined regions than when subclades were analyzed in isolation (Fig 1B). In the homoplasy density plot of the diverse CC phylogeny (Fig 1A, inner circle), homoplasy is scattered throughout the genome, indicating multiple recombination events dispersed across the genome over the evolutionary history of the species. The PI SNP density of the whole tree (Fig 1A, outer circle) is dispersed across the whole genome with a few peaks indicating some regions where recombination has occurred. These data support the hypothesis that recombination, both from outside and closely related taxa, is common in the S. aureus genome when examined across diverse groups. Regions that were filtered from the analysis are visible in Fig 1A by a lack of SNP loci and most correspond to a genomic island or phage that is not present in all the strains analyzed.
For the individual clades, very little evidence of recombination is present when analyzed on their own (Fig 1B), as has been noted previously [4,5,8,47]. Few PI SNPs and low homoplasy densities were observed when five single clades (CC8, ST239, CC5, CC59 and CC45) were analyzed. For example, the CC8 only analysis (top circular plot) demonstrated few regions of elevated homoplasy density, which is consistent with the high CI for this group of 0.99. There were 2 regions with elevated PI SNPs and both occurred in the same phage that was not included in the analysis of all taxa. In the CC5 only analysis (third circular plot from top), there is more evidence of recombination than in the CC8 analysis and this corresponds with the CI value of 0.76 calculated for this group. Elevated homplasy density is infrequent, but dispersed across the genome. The PI SNPs identified are clustered in a single region. Data indicating recombination in the three deep branches of our phylogeny (Branch 1, Branch 2 and Branch 3 in Fig 1A) are also plotted across the whole genome. Each branch contains more than one clade and either seven or eight taxa that have similar branch lengths, with the exception of the branch leading to the ST93 strain, JKD6159. These three schematic circular diagrams indicate varying, but relatively high, levels of recombination present in each branch when compared to the individual clades. For example, the circular plot of Branch 1 shows some dispersed homoplasy density (inner circle) where the total homoplastic SNPs/total PI SNPs = 20.2%, compared to Branch 2, which shows much more (36.1%), and Branch 3, which shows very little (0.84%). The density of PI SNPs appears to be less variable across the three branches analyzed, but nonetheless highlights differing regions of high SNP densities. These data indicate varying recombination rates between branches and an intermediate level of recombination when compared to the whole tree and individual clades.
The CC8 clade contained strains previously PFGE-typed as USA300 and USA500 (Fig 2A,  CI = 0.99, 1,717 total SNPs), two clinically significant groups of MRSA in the US. Additionally, this group contained isolates typed as Iberian by PFGE. The taxa included in this group are consistent with previous characterizations of these PFGE types [48]. The USA300 MRSA isolates that were included in this clade are grouped in a single, tightly clustered sub-clade and all contained the PVL locus and SCCmec type IV ( Table 1). The CC5 clade included strains with the PFGE types USA100 and USA800 (Fig 2B). This tree had a CI of 0.76 and was based on 1948 SNPs of which 632 were parsimony informative. Most of the USA100 isolates clustered on a single branch which may provide a good target for assay development for this group, previously defined by PFGE and representing the dominant strain associated with nosocomial infections in the US [49].

Molecular clock analysis
Bayesian estimation of divergence times and nucleotide substitution rates on the S. aureus full data set, using a relaxed molecular clock, revealed that the time to the most recent common ancestor (TMRCA) was 16,673 (95% CI: 4484-35976) years with an estimated mean nucleotide substitution rate of 1.8 x 10 −5 substitutions per nucleotide site per year (95% CI: 4.8 x 10 −6 -3.7 x 10 −5 ). Other estimates for S. aureus, based on single lineages, have been estimated at 10 −6 substitutions per nucleotide site per year [4,6,8], which falls within the distribution of our estimated mean rates determined from multiple lineages indicating similar rates are estimated from multiple and single lineages. Further, the same Bayesian analysis was applied to each of the two clades of strains, CC8 and CC5. For the CC8 clade, a similar nucleotide substitution rate to the overall data set, of 3.8 x 10 −5 (95%CI: 1.8 x 10 −5 -8.9 x 10 −5 ) was estimated. However for CC5, a value of 1.8 x 10 −3 (95%CI: 1.1 x 10 −7 -4.5 x 10 −3 ) substitutions per nucleotide site per year was estimated. Recent analysis of sterile site infection S. aureus isolates collected over time from the same patient, revealed a significantly higher microevolutionary rate in ST5 compared to ST8 strains [50]. For the CC5 group, a TMRCA of 190 (95% CI: 59-373) years versus an estimate of 2385 (95% CI: 588-4847) years for CC8 was determined, indicating that CC5 is a more recently emerged group. Additionally, the ST239 clade within CC8 was determined to have a TMRCA of 160 (95% CI: 34-332) years, where the 95% CI overlaps previous timing estimates for this clade [13,36]. spa type and PVL from whole genome sequencing All isolates were spa typed in silico from the WGS data, which showed similar results to previous published spa types of the epidemic lineages [7]. Identical spa types did fall within a single clade, however each clade contained more than a single spa type, (Table 1, isolates are listed in the same order as the phylogeny in Fig 1. We performed traditional PCR and Sanger methods on five strains to confirm spa types found in silico and found that four matched between traditional and WGS methods. In the fifth strain, a single repeat of the same motif was missed where WGS determined the spa type as t004 (09-02-16-13-13-17-34-16-34) and the traditional method determined spa as t266 (09-02-16-13-13-13-17-34-16-34). However, both spa types are associated with the same clonal complex, CC45.
The PVL toxin genes lukS/F-PV were not only found in the USA300 isolates mentioned previously, other genomes also contained these genes that have been suspected as indicators of community acquired (CA) strains [51][52][53]. The two CC80 strains from Denmark also contained PVL, as well as two of the three strains from CC1 (USA400) and four additional strains, all within different clades.
Virulence Genes Screen. In addition to PVL, the major groups of known virulence genes were screened across all strains and results of only those genes that varied across clonal complexes are presented in Fig 3. The majority of the genes are chromosomal with a few exceptions (S2 Table). Strains are listed in the order that they grouped on the phylogenetic tree with the clonal complexes indicated. All raw BSR values for each marker screened in this study, including invariant genes, are listed in S5 Table. The Staphylococcal enterotoxin genes exhibited distinct patterns, especially within CC8 that contains many strains of the CA-MRSA lineages. All USA300 strains sequenced contained the sek and seq genes, as well as an ortholog of sei, but not other enterotoxin genes, indicated by red box #1 in Fig 3. The majority of the USA500 strains collected in the last two decades, were positive for sea, seb, see, sek and seq. A different compliment of enterotoxin genes are present in the CC5 strains, including seg, sei, sem, sen, and seo. The USA100 strains, a subclade within the CC5 clade (Fig 3, red box 2), are the only group from which all members carry the plasmidborne sej gene. This suggests that the presence of the sej may be important for the adaptive radiation of this predominantly hospital-associated (HA) lineage. CC45, which contains all of the USA600 strains, also has a unique set of enterotoxin genes, some of which are also present in the nearest lineage of CC30.
All genomes screened contained only one set of capsular polysaccharide genes that define either the CP5 or CP8 serotypes (cap5H-J and cap8H-K, respectively). Each clade contained only one serotype, either CP5 or CP8. The hemolysin genes were present in all genomes screened as were many of the leukocidins, other than PVL (S5 Table and Table 1). The fibrinogen-binding proteins screened were present in many of the isolates with efb being present in all. Nearly all three of the Ser-Asp rich fibrinogen-binding proteins, sdrC, D and E, were partially present in almost all genomes, which has not been shown previously. However, it is possible that these sdr genes are truncated in the assemblies due to repeat regions shared among the three genes, so the results could be due to bioinformatics analysis of short read data and not represent a true biological phenomenon.

Discussion
The current study places well characterized clinical S. aureus strains, including many dominant US clonal groups, in relation to other known lineages based on WGS analysis. This high-resolution phylogenetic approach demonstrates the relationships among and between clonal groups of S. aureus. The mutational and recombinogenic differences among members within and among clades provide insights into the mutational patterns that have shaped the evolution of not only two clinically significant clades (CC8 and CC5), but also the entire species.
When comparing recombination across three levels of analysis (single clades, branches including more than one clade and the whole tree including representatives from fourteen different STs), we found recombination levels varied. In a novel analysis, we utilized homoplasy and PI SNP density as indicators of recombination within closely related taxa and from outside groups and showed clear recombination occurring across clonal lineages and clonal expansion within a single lineage. As expected, we demonstrated evidence of recombination dispersed throughout the genome with higher levels of recombination occurring across clonal groups than within. However, when analyzing the deep branches of our phylogenetic tree, we found that recombination was different for each of the three branches examined. Branch two had a homoplasy density similar to that of the overall tree indicating a very high level of recombination for this group, but differing from the other two branches analyzed. These varying levels of recombination deep in the tree show that recombination may be playing a different role in the evolution of each group and that broad generalizations to the species as a whole may not apply. Additionally, selection may be playing a role in the differing levels of recombination as some groups may be under more intense selective drug pressure than others. A recent study looking across the core genome of the S. aureus genome found that homoplasy rates varied on both a broad and fine scale as well [13].
The rates of evolution for all the taxa in this study were greater than rates estimated by other studies [4,6,8], however, the 95% CI overlaps previous estimates. The taxa here are diverse across clonal complexes and while there is a demonstrated level of homoplasy due to genomic region swapping across lineages, our analysis filtered those regions from the data set so should not be confounded by recombination. Other studies that estimate substitution rates for S. aureus included only strains of a single ST, which did not have the same levels of recombination as demonstrated in this study. However, the rate estimates for the individual lineages, CC8 and CC5, were 10 −5 and 10 −3 , substitutions per nucleotide site per year, respectively. Interestingly, a 10-fold difference in rates between these two groups was found in a study that followed individual patients and sampled repeatedly to determine the microevolutionary SNP accumulation rate over time [50]. However, the significance of this rate difference remains unclear. The TMRCA estimates for CC8 and CC5 indicate that CC5 is a more recently emerged group.
We found evidence of the S. aureus genome's flexibility when examining virulence gene distribution. Our results showed differing patterns of virulence gene distribution across separate lineages for some genes and no differences for others, as has been seen in other studies [54,55]. For example, the capsular polysaccharide genes that define the CP5 and CP8 showed distinct lineage specific variation. Previous studies have shown similar distribution of the cap5 and cap8 genes, where each clonal group is dominated by a single capsular type [56,57]. Additionally, the staphylococcal enterotoxin genes, most of which are not on mobile genetic elements, support groupings based on SNP phylogenies within the CC8 clade. There is a distinct set of the staphylococcal enterotoxin genes absent in the sub-clade that contains the USA300 strains, with only sek and seq gene products fully present. However, the USA500 group within the same clade had three additional staphylococcal enterotoxin genes present, indicating a loss of some virulence factors in the more derived, but highly successful USA300 group. A similar pattern of staphylococcal enterotoxin genes was noted in an MLST and gene analysis of CC8 strains [48]. Another successful clone, ST398, that is the predominant clone in pigs, shows a similar lack of many virulence genes in our analysis. This could be due to a potential adaptation to non-human hosts associated with the loss of human-related virulence factors shed in the process of adaptation [5]. This flexibility of the S. aureus genome allows for adaptive radiation of successful lineages, which appears to be the hallmark of this organism. Mobile genetic elements that carry many of the virulence factors in S. aureus are often lineage associated [58][59][60] An examination of our phylogeny of diverse S. aureus strains highlights some important insights about the relatedness of well-know lineages. The CC8 clade shows a tightly clustered group of USA300 isolates within the CC8 clade containing few SNPs (221 SNPs; 9 parsimony informative SNPs), suggesting a short evolutionary history of this strain that dominates in the US [61][62][63]. The USA600 (ST45) isolates are distant from all other clades in our analysis; its basal location on the rooted tree signifies an early divergence from other S. aureus. Our data demonstrates the genetic uniqueness of this group and that its evolution may be distinct from other S. aureus clades, which could contribute to the apparent increased virulence noted in some strains of this clade [64].
The USA100 isolates, while not a distinct clade, did group separately from the other strains of CC5, notably the USA800s. USA100 has been historically known as a dominant HA-MRSA strain in the USA and is still thought to be the most common strain of nasal MRSA isolates [19,62]. These genomes also carry a distinct group of virulence genes, as has been reported for HA strains [16,56,57], and may indicate separate evolutionary pressures from community-associated (CA) MRSA, even within the same clade.
While ST72 was originally thought to belong to CC8, recent evidence based on MLST data as well as 170 additional distinct genes suggest that this group is the product of recombination between CC5 and CC8 [7]. Our analysis lends further evidence to this hypothesis, given the location of two ST72 (USA700) strains on the phylogeny between the CC8 and CC5 clades.
Comparisons between US strains to those originating in other countries reveal some interesting insights into S. aureus relatedness across large geographic distances. The two strains that originated in Denmark were previously typed as CC80-MRSA-IV-the most prevalent European MRSA clone [65][66][67][68] and grouped close to the CC5 and CC8 clades, indicating that these successful MRSA clades in Europe share genetic history with the successful clades in the US. Future studies will be necessary to identify genetic similarities that lead to successful clone establishment. The ST93 strain, JKD6159, which is the predominant CA-MRSA clone in Australia [69], lies at a significant genetic distance from other S. aureus lineages in our phylogentic tree. The genome-wide SNP phylogeny supports the previous finding from a comparison of coding sequences showing the highly successful ST93 clone is divergent from other previously sequenced genomes of S. aureus, particularly other CA strains [70,71]. Further, the three lineages containing a majority of CA strains, CC30, CC8 and CC80 are distant on the phylogeny, suggesting they don't share a common ancestor or traits and that these groups have evolved independently, multiple times.

Limitations
Despite the significant conclusions we can draw from this phylogenetic analysis of S. aureus strains across diverse types, a sample selection bias likely exists. Our strain selection relied largely on publically available strains representing the PFGE types predominant in the US; however MSSA strains are underrepresented resulting in a MRSA-dominated tree. The results indicate that there is extensive microevolution in the major clades in the US. This is also likely the case in other parts of the world. However, the sample size is less than optimal to make conclusions regarding non-US strains.