Radical genome remodelling accompanied the emergence of a novel host-restricted bacterial pathogen

The emergence of new pathogens is a major threat to public and veterinary health. Changes in bacterial habitat such as a switch in host or disease tropism are typically accompanied by genetic diversification. Staphylococcus aureus is a multi-host bacterial species associated with human and livestock infections. A microaerophilic subspecies, Staphylococcus aureus subsp. anaerobius, is responsible for Morel’s disease, a lymphadenitis restricted to sheep and goats. However, the evolutionary history of S. aureus subsp. anaerobius and its relatedness to S. aureus are unknown. Population genomic analyses of clinical S. aureus subsp. anaerobius isolates revealed a highly conserved clone that descended from a S. aureus progenitor about 1000 years ago before differentiating into distinct lineages that contain African and European isolates. S. aureus subsp. anaerobius has undergone limited clonal expansion, with a restricted population size, and an evolutionary rate 10-fold slower than S. aureus. The transition to its current restricted ecological niche involved acquisition of a pathogenicity island encoding a ruminant host-specific effector of abscess formation, large chromosomal re-arrangements, and the accumulation of at least 205 pseudogenes, resulting in a highly fastidious metabolism. Importantly, expansion of ~87 insertion sequences (IS) located largely in intergenic regions provided distinct mechanisms for the control of expression of flanking genes, including a novel mechanism associated with IS-mediated anti-anti-sense decoupling of ancestral gene repression. Our findings reveal the remarkable evolutionary trajectory of a host-restricted bacterial pathogen that resulted from extensive remodelling of the S. aureus genome through an array of diverse mechanisms in parallel.


Introduction
Bacteria have a remarkable capacity to adapt to new environmental niches, a characteristic that underpins their potential to become successful pathogens. Some pathogens can infect multiple host-species [1], while others become specialized and restricted to a single host. The evolutionary process of host restriction, observed across distantly-related bacterial groups, typically involves genomic events such as gene loss, gene acquisition, or chromosomal rearrangements [2][3][4][5]. The impact of such events may be most apparent after long-term associations between bacterium and host, with obligate intracellular pathogens the most extreme examples, exhibiting extremely small and compact genomes [6].
Staphylococcus aureus is a highly versatile bacterial pathogen, associated with an array of diseases in humans and livestock representing a threat to public and livestock health [7]. S. aureus has undergone extensive host-switching events during its evolutionary history leading to the emergence of new endemic and epidemic clones in livestock and humans [8]. In particular, the highest number of host-jump events appeared to have been between humans and ruminants in either direction [8]. A combination of gene acquisition, loss of gene function, and allelic diversification have been central to the capacity of S. aureus to undergo successful hostadaptation [8].
A subspecies of S. aureus, Staphylococcus aureus subsp. anaerobius, responsible for Morel's disease [9], is highly restricted to small ruminants and its clinical manifestation is confined to abscesses located at major superficial lymph nodes. Outbreaks of this disease are associated with significant economic losses and have been reported in Europe, Africa and the Middle East [10]. In contrast to S. aureus subsp. aureus, S. aureus subsp. anaerobius is microaerophilic and catalase-negative [11], but their genetic and evolutionary relatedness is poorly understood.
Here, we investigate the evolutionary history of S. aureus subsp. anaerobius, revealing that it descended from an ancestor resembling S. aureus and adapted to its current ecological niche by extensive genome diversification involving massive genome decay, gene acquisition, and genome rearrangements. Furthermore, acquisition and expansion of ISSau8-like insertion sequences associated with attenuation of transcription of flanking genes through multiple mechanisms has led to rewiring of the transcriptome. Overall, our results reveal a remarkable example of a bacterial pathogen that has undergone extensive genome remodelling in transition to a highly niche-restricted ecology.

S. aureus subsp. anaerobius has a highly conserved genome
Previously, it has been reported that S. aureus subsp. anaerobius isolates belong to a single clonal complex as determined by multi-locus sequence typing (MLST) [10,12]. To date only one fragmented draft genome of S. aureus subsp. anaerobius isolated in Sudan has been publicly available, and therefore the genome architecture has been unclear [13]. Here, we applied PacBio sequencing to the type strain MVF7, isolated in Spain in 1981-1982 [11], to provide the first complete genome sequence for a strain of S. aureus subsp. anaerobius. The genome consisted of a circular chromosome of 2,755,024 bp with 2,888 coding DNA sequences (CDS), 56 tRNAs and 5 complete rRNA copies, with a GC content of 32.74% (Fig 1A).
In order to further explore the genome content of S. aureus subsp. anaerobius, we obtained genomic DNA for 39 additional strains and performed Illumina whole genome sequencing to examine genome diversity and population genetic structure. Of these strains, 30 were isolated from different outbreaks in Spain between 1981 and 2012, with the others from Sudan (n = 3), Italy (n = 3), Poland (n = 2) and Denmark (n = 1) [10]. Among the isolates, we identified 4 closely-related novel sequence types (ST) (see Supplementary S1 Text for more detail).
The combined dataset of 41 genomes had an average length of 2,685,366 bp (range: 2,604,446 to 2,758,945 bp) that contained 2,479 genes (92.7%) shared among all isolates (coregenome) of the total 2,675 genes identified (pan-genome). On average, each strain has a highly compact accessory genome of only 130 genes (range 107-143).

S. aureus subsp. anaerobius evolved from an ancestor that resembled S. aureus subsp. aureus over a millennium ago
The evolutionary origin of S. aureus subsp. anaerobius and its phylogenetic relatedness to S. aureus subsp. aureus is unknown. In order to investigate its evolutionary history, we constructed a core genome alignment of S. aureus subsp. anaerobius and 807 S. aureus isolates that were representative of the species diversity [8] and reconstructed a maximum-likelihood phylogeny (Fig 2). S. aureus subsp. anaerobius formed a distinct clade on a long branch within the S. aureus subsp. aureus diversity but was not genetically allied to any particular clonal complex that has been previously assigned. These data indicate that S. aureus subsp. anaerobius emerged from an unknown S. aureus lineage likely after a human to ruminant host-switch event that occurred over a millennium ago.
A core genome sequence alignment of the 41 isolates was produced with 5 recombinant regions (size range: 54 bp to 2,950 bp, see S1 Fig) excluded leaving an alignment of 2,324,010 sites with 3,443 variable sites. Phylogenetic analyses by the maximum likelihood approach revealed two distinct clades among the S. aureus subsp. anaerobius examined representing the isolates sampled in Sudan, and all others (sampled in Europe), respectively (Fig 3). The average number of SNPs between pairs of S. aureus subsp. anaerobius genomes was 494 SNPs (range: 0-1,746), and the number of SNPs between the two clades was 1,697 SNPs (1,663-1,746).
Linear regression analysis provided evidence of a molecular clock-like evolution in the ML phylogeny ( S2 Fig) and time-scaled Bayesian phylogenetic analysis indicated that the model combination that best fitted the data was a strict molecular clock paired with a constant population size, though other models performed similarly (S1 Table). This analysis dated the most Rings show, from outside to inside: annotated genes in the positive strand, those in the negative strand (with the 205 pseudogenes and 87 insertion sequences shown in red and blue, respectively), and the mobile genetic elements found across all isolates (gold: prophage FSaa1; cyan: pathogenicity island SaaPIMVF7). (B) Gene map of SaaPIMVF7. Genes in grey are pseudogenes and genes in orange are intact by comparison other SaPI relatives. int: integrase; pri: primase; scn: Staphylococcal complement inhibitor (SCIN); vwb: von Willebrand factor-binding protein (vWBP). The box shows the expression of the SaaPIMVF7-encoded vwb. The SaaPIMVF7 vwb gene was cloned into the expression vector pCN51 under the control of a cadmium-inducible promoter, transformed into a coagulase and vWbp-deficient derivative of strain RN4220 (RN4220 coa::tetM Δvwb) and the ability of SaaPIMVF7 vWbp to coagulate ruminant plasma was assessed. (C) Graphical summary of the main biological functions potentially disrupted by the presence of pseudogenes.

S. aureus subsp. anaerobius has undergone large intra-chromosomal rearrangements
During the evolutionary genomic analysis, we discovered that S. aureus subsp. anaerobius has undergone six large genomic translocations that ranged in size between 70 kb and 346 kb since separation from its S. aureus subsp. aureus progenitor (Figs 4 and S2). As the edges of each translocated region were flanked by identical insertion sequence elements, we suggest that each event was the result of transposase-mediated homologous recombination between the leading and lagging strands [15]. For 2 pairs of translocations (I, VI, and II, V, respectively) the original genomic locations were exchanged whereas translocations III and IV were the result of excision and subsequent insertion at a distinct genomic location, or by sequential single inversion events that are reciprocal (or near-reciprocal) around the origin or terminus (Fig 4). Because all translocations occurred via an inversion event, genes in the translocated areas conserved their original orientation and the GC skew remained unaffected (Fig 4).
PCR-based analysis of the distribution of the 6 rearrangements identified in strain MVF7 among all S. aureus subsp. anaerobius strains revealed that all 6 were conserved in the majority of European isolates (clade II) whereas the isolates from Sudan (clade I) lacked translocations numbers II and V, suggesting they occurred in the clade I lineage since separation from a common ancestor (S2 Table). Although the impact of the re-arrangements on global gene expression is unclear and it is feasible that these events have become fixed due to absence of any negative selective consequence, large chromosomal rearrangements in S. aureus subsp. aureus have been demonstrated to mediate transition to less virulent strains associated with persistent infection [16,17].

S. aureus subsp. anaerobius exhibits extensive genome decay
Analysis of the complete genome of S. aureus subsp. anaerobius strain MVF7 revealed the existence of 205 pseudogenes, with a range of 201 to 210 pseudogenes per genome across the 41 S. aureus subsp. anaerobius strains examined with similar numbers in each major clade (see Supplementary S1 Text). Pseudogenes originated through point mutations that caused frameshifts, premature stop codons or alternative downstream start codons and were evenly distributed around the genome (Fig 1A). As a comparison, the same pseudogene detection pipeline found an average of 14 (range: 2-30) pseudogenes per genome among the closed S. aureus subsp. aureus genomes (n = 167) in the reference dataset [8]. The high frequency of pseudogenes in S. aureus subsp. anaerobius represented an average of 9.5% (range: 9.2-10.1%) of the total genome length representing a remarkable example of extensive gene loss that is likely to have a major impact on bacterial phenotype. A total of 164 pseudogenes (approximately 80% of those found in any given isolate) were shared among all S. aureus subsp. anaerobius genomes, of which 92.1% were caused by the same mutation in all isolates consistent with early events in the evolutionary history of S. aureus subsp. anaerobius. Of note, 149 (90.8%) of these pseudogenes were part of the S. aureus "core" pangenome, as calculated from the full genomes within the S. aureus reference set [8].
The original functions of the 205 pseudogenes present in MVF7 were classified into clusters of orthologous groups (COGs) of proteins (see S3 Table). Of these, 82 (50%) were associated with metabolic functions, 29 (17.7%) to cellular processes and signalling, and 10 (6.1%) to information storage and processing, and 43 (26.2%) had unknown functions. Enrichment analyses of both KEGG pathways and GO terms (see S4 Table for a full list) revealed a statistically significant enrichment of pseudogenes involved in biosynthesis of amino acids and metabolic pathways (see Supplementary S1 Text and S3 Table). Of note, loss of function of genes mediating resistance to oxidative killing mechanisms such as catalase and other oxidoreductases may underpin the micro-aerophilic growth phenotype. In addition, an array of genes with known roles in pathogenicity including capsule biosynthesis, adherence and secreted proteins mediating interactions with the host, are predicted to be non-functional. Taken together, the extensive loss of gene function impacting on bacterial metabolism and pathogenicity is consistent with the highly fastidious growth requirements and defined disease tropism of S. aureus subsp. anaerobius (Fig 1C).

S. aureus subsp. anaerobius contains a novel pathogenicity island encoding a ruminant specific effector of abscess formation
Our genomic analysis revealed a small accessory genome with limited strain-dependent variation in gene content. However, 2 putative mobile genetic elements were identified in all strains examined including a 43.2 kb, novel prophage (FSaa1) belonging to the Siphoviridae family, and a novel 13 kb S. aureus subsp. anaerobius pathogenicity island (SaaPIMVF7) (Fig 1B). While FSaa1 contained no putative determinants of virulence, SaaPIMVF7 encodes novel variants of known virulence factors Staphylococcal complement inhibitor (SCIN) and the von Willebrand factor-binding protein (vWbp) (Fig 1B) [18,19]. A phylogenetic network of all members of the SaPI family constructed with SplitsTree v4.14.6 and the NeighborNet algorithm [20] (S4 Fig) revealed that SaaPIMVF7 is genetically closest (93.1% nucleotide identity) to SaPIov2, found among isolates of S. aureus subsp. aureus CC133 which has a tropism for small ruminants (Fig 1B). Of note, SaaPIMVF7 contained 10 pseudogenes out of a total of 21 genes identified, including the integrase and primase genes, suggesting that it can no longer be mobilized but is a stable feature of the chromosome. Of note, both the vwb and scn genes were intact and transcribed (S5 Fig), consistent with functionality (Fig 1C). The vWBP has previously been demonstrated to promote coagulation of plasma, and is involved with abscess formation during invasive infection [21,22]. In order to test the hypothesis that SaaPIMVF7-encoded vWBP was functional, we cloned the vwb variant into the expression vector pCN51, and introduced this plasmid into a coagulase and vWbp-deficient derivative of strain RN4220 (RN4220 coa::tetM Δvwb). Importantly, expression of the variant vWBP protein conferred the capacity for coagulation of ruminant plasma (Fig 1B). In summary, all strains of S. aureus subsp. anaerobius contain a novel SaPI indicating an acquisition event that occurred over 1000 years ago prior to divergence of the 2 major subclades. Strikingly, the accumulation of loss-of-function mutations in genes required for mobilization indicate that SaaPIMVF7 is a stable element of the genome and functional expression of vWBP with ruminant coagulation activity suggests a potential role in abscess formation, a defining characteristic of Morel's disease.

Expansion and intergenic location of insertion sequences in the genome of S. aureus subsp. anaerobius
An additional striking feature of S. aureus subsp. anaerobius was the presence of 87 insertion sequences (IS) distributed around the genome in strain MVF7 (Fig 1A). Each IS had 99.3% nucleotide identity with each other, and 97% identity with the previously described ISSau8 from bovine S. aureus strain RF122. The great majority (77 of 87) had premature stop codons that truncated the transposase gene whereas only 10 IS contained full-length intact transposase genes. The distribution of the IS elements identified in MVF7 among the rest of S. aureus subsp. anaerobius isolates was examined by mapping Illumina reads for each strain against the border between each IS and its flanking gene in the closed genome of strain MVF7 to identify reads overlapping this border. This analysis revealed that 68 out of the 87 MVF7 ISs (78.2%) were inserted at an identical location in all 41 S. aureus subsp. anaerobius isolates consistent with an ancient acquisition that occurred during the emergence of S. aureus subsp. anaerobius. An additional 7 IS elements were shared among all the European isolates, but were absent from isolates from Sudan.
Previously, IS elements have been shown to contribute to loss of gene function or phase variation via direct insertion into CDS sequences [23]. However, all except one IS element were inserted in intergenic regions (the single exception was inserted into the msbA CDS), and we speculated that the non-random distribution of IS elements may influence the expression of the chromosomal genes located adjacent to the IS insertion site.

IS affects neighbouring gene regulation in S. aureus subsp. anaerobius by disruption of promoter region and operon structure
Our in silico analysis of promoter and transcriptional terminator (TT) sequences indicated that each IS had a functional promoter but lacked a TT between the tnp gene and the downstream gene, supporting the idea that the tnp transcript could influence expression of the 3'located gene. Furthermore, 2 putative TT were identified at different sites in the IS: TT1 is located in the 5' region of the IS genes, and is bidirectional, while TT2 is located at the 3' end region of the IS and is unidirectional in the antisense orientation (see Fig 5). The putative role for the bidirectional TT1 would be to prevent the interference between transcripts initiated either from the IS or from the gene located upstream to the IS. By contrast, TT2 would prevent the impact that an antisense transcript originated from a gene located downstream of the IS could have on the expression of the IS transcript (Fig 5). This organisation suggests a strategy that allows IS transcripts to interfere with the genes located 3' to the IS, while blocking any interference that transcripts from flanking genes could have on the IS-derived regulatory transcript.
To validate the functionality and directionality of the TTs, we employed plasmid pCN42 [24], in which the blaZ reporter is under the control of the P cad promoter. The conserved 5' and 3' region of the ISs containing the bi-and unidirectional TTs, respectively, were inserted, in both orientations, between the P cad promoter and the blaZ reporter (see Fig 5), and the expression of the reporter gene measured with and without induction of the Pcad promotor. The results confirm the existence, functionality and directionality of both TTs, consistent with the hypothesis that ISs can control the expression of 3'-located chromosomal genes.
It is likely that some of the IS insertions have been fixed in the population due to the effects of genetic drift after a bottleneck. However, it is also possible that intergenic IS insertions are fixed by selection because they beneficially influence the expression of neighbouring genes. In order to test this hypothesis, we examined the impact of IS on the expression of selected downstream genes located in the same orientation as the IS. Since S. aureus subsp. anaerobius is not genetically tractable, we designed reporter constructs reconstructing the genetic organisation found in S. aureus subsp. anaerobius (S6 Fig) and tested gene expression levels in S. aureus subsp. aureus. Specifically, we selected four different genes (rpsP, dnaD, metC and metN) that were identified in S. aureus subsp. anaerobius to contain an IS at distinct distances upstream of the gene start codon, and constructed blaZ transcriptional reporter fusions using pCN41 [24] (Fig 6A-6D). The expression of these genes in the presence or absence of upstream ISs was then tested. Importantly, and in support of the hypothesis, absence of an IS impacted on the expression of the neighbouring genes, although this effect was different depending on the gene including increased transcription of dnaD, metC or metN, or decreased rpsP transcription (Fig 6A-6D). The impact on downstream gene expression likely depends on the position of the IS relative to the gene's start codon and the nature of its integration (conservative or deleterious; see S6 Fig). In some cases, the IS will reduce expression by altering either the promoter or the DNA binding domain of a putative inducer of the flanking gene, while in other cases it may promote enhanced transcription by eliminating the binding site of a repressor. These hypotheses are currently under investigation.

ISs affect S. aureus subsp. anaerobius gene expression by a novel antiantisense RNA interference mechanism
Intriguingly, the majority of ISs (63 of 87) were located in an antisense orientation to putative target genes. In such an orientation, the previously described effects on gene expression would not be feasible, and we reasoned that distinct mechanisms of IS-mediated control of neighbouring gene expression may exist. To test this hypothesis, we selected 2 examples of S. aureus subsp. anaerobius genes that were located in an anti-sense orientation to the flanking IS, including mgrA, encoding a global regulator of virulence [25][26][27], and adhA encoding a zincdependent alcohol dehydrogenase, induced under low oxygen conditions [28] (S4 and S5  Figs). Since the antisense orientation of these genes relative to IS precludes monitoring of gene expression by transcriptional fusions, they were cloned to express 3xFLAG tagged versions of the encoded proteins, which facilitated measurement of expression levels by western immunoblot. To examine the impact of IS on gene expression, expression plasmid constructs with and without IS in the antisense orientation were introduced to S. aureus subsp. aureus strain RN4220 Δspa and expression levels compared (see scheme in S7 Fig). Surprisingly, no differences in the expression levels of MgrA or AdhA were observed in any of the constructs suggesting that the presence of ISs did not actively alter expression levels of the downstream genes by a classical anti-sense mechanism (S7A and S7B Fig, respectively).
However, the analysis of the genetic context of mgrA revealed another possible mechanism of IS-mediated control of gene expression. In the ancestral S. aureus subsp. aureus, the gene downstream of mgrA is in an antisense orientation (S6 Fig). Furthermore, overlapping antisense transcripts for mgrA and its downstream gene in S. aureus have been identified in a previous study [29], suggesting the possibility of antisense interference of mgrA gene expression. Accordingly, we hypothesized that the presence of an IS between the 2 genes in S. aureus subsp. anaerobius could disrupt the putative interference (antisense) mechanism. To examine this possibility, we constructed another set of reporters (see scheme in Fig 6E) in which we used the inducible expression plasmid pCN51 to mimic expression from the gene adjacent to mgrA. Since the gene located downstream of mgrA belongs to a large operon that was impossible to clone into pCN51, we used the cadmium-inducible promoter present in plasmid pCN51 as the origin of the anti-sense mRNA for mgrA, mimicking the natural genetic context (Fig 6E). The different constructs were introduced into the S. aureus subsp. aureus strain RN4220 Δspa and expression from the cadmium-inducible pCN51 promoter induced. Absence of the IS resulted in much reduced MgrA expression levels suggesting that presence of a strong transcriptional terminator protects mgrA mRNA from antisense mRNA interference (Fig 6E). In support of this idea, insertion of a TT between mgrA and the plasmid-encoded cadmium-inducible promoter produced the same effect as the presence of the complete IS (Fig 6E). A similar effect was performed using two-way ANOVA followed by Dunnett's post-test. Multiplicity adjusted p-values are shown for uninduced to induced comparison and all terminator containing reporters showed significant differences when compared to their respective empty plasmid control. ���� p<0.0001, ns not significant. https://doi.org/10.1371/journal.ppat.1009606.g005

Fig 6. Presence of IS influences the expression of downstream genes and genes product through multiple mechanisms. (A-D)
Reporter constructs for presence of the IS upstream and in sense orientation of the target gene were designed in plasmid pCN41 containing a β-lactamase reporter gene. The green triangle indicates the IS position relative to the target gene's start. All reporters were constructed either by removing the IS gene from S. aureus subsp. anaerobius strain MVF84 or by introducing the IS into S. aureus subsp. aureus strain RF122 to confirm the IS role on gene expression variation. In (A) the IS removal results in a 207bp deletion to the promoter region not present in RF122. Reporter plasmids were introduced into RN4220 as described in Methods. Data show the mean of three biological replicates, error bars represent the mean's standard deviation. One-Way ANOVA was performed followed by Tukey's multiple comparisons test. � p<0.05, �� p<0.01, ��� p<0.001, ���� p<0.0001, ns not significant. (E&F) Western blot analysis of (E) mgrA and (F) sle1 expression constructs to address whether the IS impacts on their expression levels by interrupting expression of an antisense transcript. (E) mgrA encoding a 3xFLAG tag was cloned in antisense to a cadmium-inducible promoter into plasmid pCN51 to mimic the expression of an antisense transcript to mgrA. Constructs presented different combinations of presence/absence of IS and/or transcriptional terminator the IS transcript or expression from the plasmid-encoded cadmium-inducible promoter from the mgrA transcript. Expression of the cadmium-inducible promoter was induced with 5 μM CdCl 2 . (F) To assess the IS impact in the context of a natural antisense transcript, 3xFLAG-encoding expression constructs of sle1 including the gene downstream of sle1 in the ancestral genome were cloned into the plasmid pCN47. Samples were taken during different growth phases. Plasmids were introduced into RN4220 Δspa as described in Methods. Western blots shown are representative of at least two independent biological replicates. TT, transcriptional terminator.
https://doi.org/10.1371/journal.ppat.1009606.g006 identified on sle1 encoding an N-acetylmuramyl-L-alanine amidase involved in staphylococcal cell separation and β-lactam resistance [30,31], whereby IS insertion has disrupted the antisense regulation of Sle1 expression by a neighbouring gene in a growth phase-dependent manner (Fig 6F). Thus, IS in S. aureus subsp. anaerobius can uncouple genes from ancestral antisense regulation by acting as anti-antisense elements, a hitherto unknown regulatory role.
Overall, our data suggest that the non-random fixation of IS into S. aureus subsp. anaerobius intergenic regions has influenced the expression of neighbouring genes through multiple distinct mechanisms. The putative attenuated expression of up to 87 different chromosomal genes involved in an array of different functions (S5 Table) has led to the re-wiring of the transcriptome.

Discussion
While S. aureus is an opportunistic pathogen responsible for an array of different pathologies in different anatomical sites in humans and animals, S. aureus subsp. anaerobius is restricted to a specific infection of superficial lymph nodes in sheep and goats. Here, we provide a remarkable example of the evolutionary transition of a versatile multi-host bacterium to a fastidious highly niche-restricted endemic pathogen of small ruminants. The transition was marked by multiple distinct evolutionary processes mediating drastic changes to the genome that resulted in a complete re-modelling of bacterium-niche interactions. Previous studies have indicated a human ancestral host for S. aureus and endemic livestock clones are the result of host jump events that have occurred during the evolutionary history of S. aureus [8,32]. Accordingly, the co-segregation of all isolates of the S. aureus subsp. anaerobius within a single monophyletic clade in the S. aureus phylogeny suggests a likely human to ruminant hostswitch event that preceded the evolutionary transition to a highly niche-specific ecology.
Such switches to a host-restricted lifestyle [33], have occurred across the bacterial kingdom in diverse lineages such as Yersinia [34], Mycobacterium [35], Shigella [36], Salmonella [37], and Burkholderia [38], and extreme examples are represented by endosymbionts that have evolved from free-living organisms to become dependent on a single host species [2,6]. While most known examples of host-restrictive evolution have time-scales of hundreds of thousands up to many millions of years [6,33], our analysis indicates that S. aureus subsp. anaerobius evolved about 1000 or more years ago offering a unique insight into the relatively early stages of evolution towards niche-restriction.
Host shifts are associated with a radical change in habitat, typically with a genetic bottleneck that diminishes effective population size and a corresponding reduction in purifying selection activity [2,33]. We discovered that approximately 10% of the S. aureus subsp. anaerobius genome is made up of pseudogenes affecting an array of metabolic and pathogenic pathways associated with the fastidious nutritional requirements and limited virulence of S. aureus subsp. anaerobius (Fig 1C). In particular, the microaerophilic metabolism, is likely to be due to the loss of function of catalase and other oxidoreductase genes leading to increased sensitivity to oxidative free radicals. These data suggest adaptation to a nutrient-rich, oxygen-limited niche, such as that provided by the lymphatic system [11]. The repair of mildly deleterious mutations such as those resulting in loss of gene function is likely compromised by lower levels of purifying selection leading to fixation by genetic drift, and strikingly, there is no evidence for deletion of genes that are not functional in the new habitat, perhaps because sufficient time has not elapsed to facilitate this process, possibly also impacted by the observed lower rate of mutation.
We identified a large number of closely-related IS elements distributed around the genome of S. aureus subsp. anaerobius (Fig 1A; S3 Table), a phenomenon that has been observed among other bacteria evolving towards host-restriction [2,4,6,38]. Ineffective purifying selection after a population bottleneck also allows insertion sequences (IS), normally present in bacterial genomes in low numbers (<10), to expand and disseminate around the genome in the early stages of host-restriction. Over time the IS elements may be eventually purged by deletion of regions of the genome via IS-mediated recombination, such that recently host-restricted bacteria will generally contain more IS than ancient symbionts [2,4]. It is well established that IS elements often insert into coding regions resulting in gene inactivation and can also mediate chromosomal rearrangements of genetic segments flanked by ISs via homologous recombination events [15]. We identified five large rearrangements that occurred in the common ancestor of all isolates and an additional event exclusive to the clade comprising European isolates. Although the effects on bacterial phenotype resulting from the identified rearrangements is unknown, large rearrangements can change the distance of a gene from the origin of chromosome replication leading to altered gene copy number and expression [39,40] and similar rearrangements in S. aureus have been reported to mediate transition to reduced virulence phenotypes such as single colony variants associated with persistent infections [16,17].
While IS elements often inactivate genes, it is striking that only one of 87 IS insertions in S. aureus subsp. anaerobius results in gene disruption. It is established that IS elements can also activate expression of neighbouring genes, either by an extended transcription from an internal promoter or by the generation of a hybrid promoter [23,41]. Uniquely, the ISmediated gene regulation of S. aureus subsp. anaerobius involves one established and one novel mechanism of control. Upstream IS elements oriented in the sense orientation relative to the flanking gene (~25% of IS) act mainly through modulation of promoter and operon structure as previously reported [42]. However, in S. aureus subsp. anaerobius, the majority of ISs (~75%) are located downstream in the antisense orientation. In the ancestral S. aureus subsp. aureus strain, the expression of some genes is controlled by transcripts of a downstream gene in the antisense orientation suggesting expression of the corresponding proteins is mutually exclusive [43]. Here, we have shown that in S. aureus subsp. anaerobius, IS inserted in an antisense orientation to flanking genes can effectively uncouple the targeted genes from their interdependent expression control and act as anti-antisense regulatory elements. This novel regulatory mechanism might provide a selective advantage to S. aureus subsp. anaerobius in facilitating the simultaneous expression of proteins required concurrently in the new niche.
While some bacterial pathogens such as Bordetella pertussis utilise IS elements to control the expression of their flanking genes in a strain-specific manner [42], the genomic localisation of the majority of IS elements in S. aureus subsp. anaerobius is conserved across the phylogeny, indicating that insertion events happened early in the evolution of S. aureus subsp. anaerobius, and suggesting that the complement of genes affected may be important for the ecology of S. aureus subsp. anaerobius.
In conclusion, using a combination of phylodynamic, comparative genomic and molecular biology approaches, we have dissected the relatively recent evolution of a host-restricted bacterial pathogen underpinned by drastic changes to the genome via numerous distinct processes. In particular, IS elements have been domesticated by S. aureus, contributing to the chromosomal architecture via intra-chromosomal rearrangements, and control of gene expression via multiple mechanisms. Taken together, our findings provide a unique and remarkable example of the capacity of bacterial pathogens to expand into new host niches.

Bacterial strains and growth conditions
The bacterial strains employed are detailed in S6 Table. S. aureus was grown in Tryptic soy broth (TSB) or on Tryptic soy agar and Escherichia coli was grown in Luria-Bertani broth (LB) or on LB agar. Antibiotic selection was used where appropriate (erythromycin 10 μg ml -1 for S. aureus and ampicillin 100 μg ml -1 for E. coli.

Whole genome sequencing and genomic analysis
Forty S. aureus subsp. anaerobius ovine isolates previously reported [10] were selected, of which 31 were sampled in Spain during different outbreaks spanning 3 decades (1981-2012). The rest of the isolates, sampled between 1996 and 2011, were from Sudan (n = 3), Italy (n = 3), Poland (n = 2) and Denmark (n = 1). Isolates were sequenced using a MiSeq machine (llumina, San Diego, CA, USA); the paired-end short reads were trimmed using Trimmomatic v0.36 [44] and de novo assembled into contigs using SPAdes v3.10.0 [45] and Velvet v1.2.10 [46]. Illumina read data is available in the European Nucleotide Archive under the study accession number PRJEB30965. We provide read quality statistics for the Illumina samples in S9 Table. One of the Spanish isolates (MVF7, the type strain) was sequenced on the RSII platform (Pacific Biosciences, Menlo Park, CA, USA) using SMRT technology in the Centre for Genomic Research (University of Liverpool, UK). The long reads were de novo assembled into a single contig using the Hierarchical Genome Assembly Process (HGAP) method. The MVF7 full genome assembly was deposited at NCBI with GenBank accession number GCA_014876765.1. The average read coverage depth was 479X (range: 120-790X) and the average read length was 18,464 bp.

Pseudogene detection
Here, pseudogenes were defined as protein-coding sequences with homologous genes in an S. aureus subsp. aureus reference that were split or truncated (i.e. <80% of the reference gene length). For this purpose, the isolate RF122 (assembly GCA_000009005.1) was employed as a reference and a custom python script (https://github.com/GonzaloYebra/anaerobius) was developed to identify pseudogenes from the output of a Roary analysis of S. aureus subsp. anaerobius isolates compared to the S. aureus RF122 genome. The analyses were repeated using multiple S. aureus subsp. aureus genome sequence references, but this did not alter significantly the pseudogenes detected. Ancestral functions of pseudogenes were predicted by assigning them to clusters of orthologous groups (COGs) using eggNOG v4.5.1 [52]. Enrichment analyses of GO (Gene Ontology) terms and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways assigned by eggNOG were performed using the R packages topGO v2.34.0 and ClusterProfiler v3.10.0 [53]. Additionally, the Integrated Microbial Genomes (IGM) annotation tool [54] was used to infer phenotypic metabolic characteristics from the presence/absence of protein pathways.

Insertion sequence detection
Transposase coding sequences were identified in the MVF7 complete genome using the Prokka output and the ISsaga web tool hosted in the ISfinder platform [55]. Determining the location of IS elements and other repetitive sequences by assembly of short-read sequences is often not feasible [56]. Therefore, in order to examine the distribution of IS identified in strain MVF7 among other isolates, Illumina reads were mapped to fragments of the MVF7 genome representing each transposase edge together with their flanking genes using BWA-MEM [57]. In this manner, identification of reads that spanned the border between the IS and the flanking gene indicated the presence of that IS at the same genomic location relative to MVF7.

Phylogenetic analyses
A core genome alignment was created by aligning the Illumina short reads and the ST1464 assembly to the MVF7 whole genome using Snippy v3.1 (https://github.com/tseemann/ snippy). Sites containing any gap character ('-') or unknown nucleotide ('N') were discarded. Gubbins v2.2.0 [58] was used to detect recombinant regions which were then discarded. A maximum likelihood (ML) tree was constructed using IQ-TREE [59], applying the GTR nucleotide substitution model together with a gamma-distributed rate heterogeneity across sites and 1,000 ultrafast bootstrap replicates. Temporal signal was investigated using TempEst v1.5 [60] by means of the correlation between root-to-tip distances and sampling dates (S2 Fig).
Bayesian phylogenetic analysis was performed with BEAST v1.9.0 [61] using the HKY model for nucleotide substitution. Different models were tested for the molecular clock (strict and uncorrelated lognormal relaxed) and demographic (constant, exponential and Bayesian skygrid) models. Each of these model combinations were run for 100 million generations, with sampling every 10,000 and discarding the initial 10% as burn-in. Runs were compared via a marginal likelihood estimation (MLE) using path sampling and stepping stone sampling methods implemented in BEAST. The posterior distribution of trees was summarised into a maximum clade credibility tree.
In order to examine the relatedness of S. aureus subsp. anaerobius and S. aureus, another core genome SNP tree was created, also using Snippy v3.1 and IQ-TREE (same settings than above). It included a sample of 790 S. aureus genomes (corresponding to 43 different host species and 77 clonal complexes (CCs), isolated in 50 different countries) along with 17 isolates of the most closely-related staphylococcal species (S. schweitzeri and S. argenteus) [8]. All bioinformatic analyses were carried out using the Cloud Infrastructure for Microbial Bioinformatics (CLIMB) facility [62].

Gene cloning
General DNA manipulations were performed using standard procedures. PCR fragments were amplified from genomic DNA using Kapa Hifi DNA polymerase (Kapa Biosystems). PCR fragments were either digested using restriction endonucleases and ligated into the respective plasmid backbone or assembled directly into a linearised plasmid backbone using Gibson assembly (NEBuilder HiFi DNA Assembly, NEB) according to the manufacturer's instructions. The plasmids and oligonucleotides used in this study are listed in S7 and S8 Tables, respectively.

Western blot analysis
S. aureus cultures harbouring the respective pCN47 or pCN51 derivative plasmids were diluted 1:50 from overnight cultures and grown in TSB at 37˚C and 120 rpm until sample collection. For pCN51 constructs induced with CdCl 2 , cultures were split at early exponential phase (OD 540~0 .15), 5 μM CdCl 2 was added to half the cultures and incubation continued for another 2 h. S. aureus strains carrying pCN47 reporter constructs were grown to early exponential (OD 540~0 .15), mid-exponential (OD 540~0 .8) or stationary phase (OD 540~2 after overnight culture). A sample amount corresponding to and OD 540 of 0.5 in 1 ml was pelleted and stored at -20˚C. The sample pellets were re-suspended in 100 μl digestion/lysis buffer (50 mM Tris-HCl, 20 mM MgCl 2 , 30% (w/v) raffinose) plus 1 μl of lysostaphin (12.5 μg ml -1 ) and incubated at 37˚C for 1 h. Samples for SDS-PAGE were prepared by adding 4X NuPAGE LDS Sample Buffer (Invitrogen) and proteins denatured at 95˚C for 10 min. Samples were separated by SDS-PAGE (NuPAGE 4-12% Bis-Tris Protein Gels, Invitrogen and then transferred to a PVDF transfer membrane (Thermo Scientific, 0.2 μM) following standard procedures [63,64]. FLAG-tagged proteins were detected using mouse anti-FLAG-HRP antibody (Monoclonal ANTI-FLAG M2-Peroxidase (HRP), Sigma-Aldrich) following the manufacturer's protocol.

Enzyme assay for the quantification of β-lactamase activity in transcriptional fusion plasmids
Samples for pCN41 based reporters were grown to either early exponential phase (OD 540~0 .15) or stationary phase for promoters with low activity (metC) (OD 540~2 after overnight culture) and 1 ml of culture snap frozen. β-Lactamase assays, using nitrocefin (BD Diagnostic systems) as substrate, were performed as previously described [65] using an ELx808 microplate reader (BioTek) measuring absorbance at 490 nm. Promoter activity was calculated using the following equation: