Global emergence and population dynamics of divergent serotype 3 CC180 pneumococci

Streptococcus pneumoniae serotype 3 remains a significant cause of morbidity and mortality worldwide, despite inclusion in the 13-valent pneumococcal conjugate vaccine (PCV13). Serotype 3 increased in carriage since the implementation of PCV13 in the USA, while invasive disease rates remain unchanged. We investigated the persistence of serotype 3 in carriage and disease, through genomic analyses of a global sample of 301 serotype 3 isolates of the Netherlands3–31 (PMEN31) clone CC180, combined with associated patient data and PCV utilization among countries of isolate collection. We assessed phenotypic variation between dominant clades in capsule charge (zeta potential), capsular polysaccharide shedding, and susceptibility to opsonophagocytic killing, which have previously been associated with carriage duration, invasiveness, and vaccine escape. We identified a recent shift in the CC180 population attributed to a lineage termed Clade II, which was estimated by Bayesian coalescent analysis to have first appeared in 1968 [95% HPD: 1939–1989] and increased in prevalence and effective population size thereafter. Clade II isolates are divergent from the pre-PCV13 serotype 3 population in non-capsular antigenic composition, competence, and antibiotic susceptibility, the last of which resulting from the acquisition of a Tn916-like conjugative transposon. Differences in recombination rates among clades correlated with variations in the ATP-binding subunit of Clp protease, as well as amino acid substitutions in the comCDE operon. Opsonophagocytic killing assays elucidated the low observed efficacy of PCV13 against serotype 3. Variation in PCV13 use among sampled countries was not independently correlated with the CC180 population shift; therefore, genotypic and phenotypic differences in protein antigens and, in particular, antibiotic resistance may have contributed to the increase of Clade II. Our analysis emphasizes the need for routine, representative sampling of isolates from disperse geographic regions, including historically under-sampled areas. We also highlight the value of genomics in resolving antigenic and epidemiological variations within a serotype, which may have implications for future vaccine development.


Introduction
Pneumococcal disease caused by Streptococcus pneumoniae remains a significant cause of morbidity and mortality even in the era of effective conjugate vaccines, which protect against up to 13 of the nearly 100 known serotypes [1]. Serotype is based on biochemical and antigenic properties of the polysaccharide capsule, a key virulence factor encoded by the capsular (cps) locus and the target of the pneumococcal conjugate vaccines (PCV). Among those serotypes covered by the 13-valent pneumococcal conjugate vaccine (PCV13), serotype 3 is considered highly invasive and is associated with a high risk of mortality in both observational studies and animal models [2][3][4][5]. However, the serotype-specific effectiveness of PCV13 against serotype 3 remains uncertain. There has been little reduction in serotype 3 disease compared to disease due to other vaccine serotypes following implementation of PCV13 in the USA (Fig 1) [6]. In addition, data from a randomized controlled trials on nasopharyngeal carriage [7], post-licensure studies of PCV13 on invasive pneumococcal disease (IPD) [8], and national surveillance data on IPD from England and Wales [9,10], among other studies have shown little vaccine effect on serotype 3.
Pneumococci can be divided into lineages, often referred to as sequence clusters, based on statistical analysis of nucleotide diversity in the core genome [11][12][13]. In general, lineages are often comprised of strains possessing the same multi-locus sequence typing (MLST) and serotype. However, whereas MLST profiles are unlikely to appear in multiple lineages, the same serotype may be distributed across the population due to the transfer of the genes encoding the capsule between lineages by homologous recombination [14]. MLSTs can be further grouped into clonal complexes (CCs) [15]. For serotype 3, MLST data have shown that the majority of isolates globally are a single CC of closely related genotypes (CC180) also known as the Netherlands 3 -31 (PMEN31) clone [16][17][18][19], with the exception of Africa where non-CC180 MLST types are prevalent [20].
However, more recent genomic studies have shown that CC180 contains multiple distinct lineages [18,21], which MLST is not sufficiently discriminating to detect. While one of these lineages (termed 'Clade I') accounts for almost all previously studied genomes, the overwhelming majority (92%) of these isolates came from European collections, and might not be representative of the global pneumococcal population.  [6]. Rates of IPD expressed as cases per 100,000 population are on the y-axis, and calendar year of surveillance are on the x-axis. The green line represents the IPD rate for the five most common non-PCV13 serotypes; purple line, the rate for serotype 3 only; orange line, the rate of (PCV13 -serotype 3) serotypes containing (1, 4, 5, 6A, 6B, 7F, 9V, 14, 18C, 19A, 19F, and 23F). Further information can be obtained from the CDC's ABC surveillance (www.cdc.gov/abcs/). Following roll-out of PCV13, an unexpected increase in the carriage prevalence of serotype 3 was noted among Massachusetts children less than seven years of age [22]. Initial genomic analysis of these carriage isolates showed a change in the serotype 3 population [23]. Prior to vaccination the majority of isolates fell into the previously described 'Clade I', while following PCV13 introduction, the observed increase in carriage prevalence was largely due to isolates which, while still CC180, were drawn from a more diverse population that has been poorly represented in prior samples. Phenotypic variations within CC180 and how they may relate to vaccine efficacy have not been investigated and remain poorly understood.
To investigate the population structure and evolutionary history of serotype 3 CC180 pneumococci, we conducted a genomic analysis of 301 serotype 3 CC180 isolates from carriage and disease collected from 24 countries over 20 years. Further, we assessed multiple phenotypic features linked to the epidemiology of serotype 3, and which might contribute to vaccine escape including reaction to PCV13 antisera, and how they vary in the CC180 population [24][25][26].

Study population and epidemiological data
The sample consisted of 301 CC180 (PMEN31) genomes collected from carriage (n = 68), invasive disease (n = 231), and unknown clinical manifestations (n = 2) between 1993-2014. Whole-genome sequencing (WGS) data for this sample were obtained from a previous analysis of serotype 3 described elsewhere (n = 82) [18], on-going studies of carriage in Massachusetts, USA (n = 27) [12], carriage studies in the Southwest USA (n = 14) [27], the Bacterial Isolate Genome Sequence database (BIGSDB) (n = 10), and the Global Pneumococcal Sequencing (GPS) project (http://www.pneumogen.net/gps/) (n = 168). The Harvard IRB approved Massachusetts and Southwest USA carriage studies described elsewhere [12,18,28,29]. The contributions of isolates by GPS sites were approved by the respective institution's IRB. These data as well as publicly available and previously published data were anonymized for our analysis. Available data included the year and country of isolation, isolation source, and limited patient demographic data. To assess the temporal shift in prevalence of the three CC180 clades, we tested the significance of changes in the proportion of isolates by clade for each sampling year by comparing yearly proportions to 1000 random deviates of a Dirichlet distribution [30]. If the temporal change in proportion was found to be significant, then the directionality of the change was assessed. Current and past PCV (PCV7, PCV10, and PCV13) use data for countries where the serotype 3 CC180 sample was collected were queried from International Vaccine Access Center (IVAC) VIEW-hub website (www.view-hub.org, accessed April 5, 2017). Fisher's exact test was used to test the association between countries that introduced PCV and serotype 3 clade emergence based on phylogenetic demography (See phylogenetic analysis methods). Further, among countries that introduced PCV13, the correlation between the date of introduction and changes in serotype 3 demography was assessed using Pearson's correlation coefficient. All statistical analysis and figure generation was performed using Rstudio v1.0.143 with R v3.3.1.

Genomic analysis
For GPS isolates, raw sequencing data and de novo assemblies were downloaded from http:// www.pneumogen.net/gps/. GPS de novo assemblies were generated using a Velvet pipeline as previously described [31]. The remaining sequencing data were downloaded from the NCBI SRA database (See S1 File for a list of accession numbers). If only draft assemblies were available, paired-end 150 bp reads were generated using the BBMap's RandomReads script. De novo assemblies for non-GPS isolates were generated using SPAdes v3.10 as previously described [18,32]. Assemblies were then annotated using Prokka v1. 12 [33] and pangenome analysis was conducted using Roary [34]. Assemblies constructed using simulated reads were checked for variation in assembly metrics, annotation, and SNP diversity, in order to detect any evidence of sequencing and assembly errors. Variants for 13 polymorphic pneumococcal protein antigens-including pneumococcal surface proteins C and A (pspC and pspA)-were determined by mapping raw reads to an antigen variant database using SRST2 as previously described [27]. Quality filtered and trimmed reads were mapped to S. pneumoniae OXC141 (NCBI Reference Sequence: NC_017592), a Clade I, serotype 3 ST180/CC180 carriage isolate, using SMALT v0.7.6. SNPs were called using SAMtools v1.3.1 [35], and were filtered requiring QUAL>50, depth of coverage �5 and a minimum alternate allele frequency �0.75 [12,36]. Gubbins v2.1 was used to assess recombination, and coding sequences (CDSs) impacted by recombination blocks were annotated and plotted using Circos [37,38]. Last, we assessed nonsynonymous mutations and recombination in the comCDE operon, which has previously been implicated in the low competence of CC180 compared to other pneumococcal lineages [18].

Phylogenetic analysis
A maximum likelihood (ML) phylogeny was inferred from a recombination-censored SNP alignment using RAxML v8.2.1 with an ASC_GTRGAMMA nucleotide substitution model, Lewis ascertainment bias correction, and 100 bootstrap replicates [39]. The ML phylogeny was rooted using strain AP200 (accession # CP002121), a serotype 11A, ST62 S. pneumoniae invasive isolate, which was found to be immediately basal to the CC180 clade [40]. After assigning isolates to major clades, Gubbins was run independently on the sequence alignments from each clade to identify putative recombination events. ML phylogenies from recombinationcensored alignments were used to test temporal signal by assessing correlation between strain isolation date and root-to-tip distance. To reduce bias in coalescent analysis due to differences in sampling over the study period, isolates were subsampled using a uniform probability from each year and spatial location (country and region), as recommended in [41]. To verify temporal signal, we performed date randomization tests on the down-sampled datasets, whereby the sampling times of sequences are shuffled and evolutionary rates are compared to correctly assigned time estimates (see S1 Methods) [42,43]. For major Clade I (hereon referred to as Clade I-α) and Clade II, coalescent analysis was performed using BEAST v1.8.4 (see S1 Methods) [44]. Parameter estimates for the evolutionary rate, root height, and N e were obtained from the best-fit model and compared between clades. Last, to infer the ancestral geographic location and migration history of Clade II, we used the structured coalescent implemented in Beast 2.4.4 specifying the region of collection as the geographic location for each tip [45,46]. Additionally, BEAST XML code has been made available (https://github.com/c2-d2/Projects).

Phenotypic and genotypic antibiotic resistance
Antimicrobial resistance (AMR)-associated genes and SNPs were identified with ARIBA using ARG-ANNOT and CARD databases [47,48]. Penicillin MICs were predicted using WGS data to type transpeptidase domains of penicillin binding proteins [49]. Genotypic antibiotic resistance was validated among strains with available broth dilution data using published CLSI breakpoints for penicillin, chloramphenicol, erythromycin, clindamycin, and tetracycline. For isolates that possessed multiple AMR-associated genes, annotated de novo assemblies were investigated to identify conjugative transposons. Transposons were identified through review of de novo assembly annotations and their presence/absence among strains was confirmed by mapping sequencing reads to transposons as described above.

Assessment of capsular polysaccharide variations
First, we assessed variations in the CPS loci by abstracting the region from reference-based genome assemblies and generating an ML phylogeny using RAxML with GTRGAMMA substitution model and 100 bootstrap replicates. The mean pairwise nucleotide difference was also calculated enforcing pairwise deletion of missing sites. CPS loci were manually inspected for recombination events using the output from Gubbins to determine the potential role in clade variation.
To investigate phenotypic variations related to the serotype 3 CC180 capsular polysaccharide between Clades I-α and II and potentially explain the recent emergence of Clade II, we assessed surface charge (zeta potential), capsular release, and opsonophagocytic killing. Zeta potentials among representative Clade I-α (n = 5) and Clade II (n = 3) serotype 3 strains from Massachusetts were compared to S. pneumoniae serotype 3 ST378 laboratory strain WU2 and ΔCPS WU2 as previously described [25]. Capsular release, a mechanism by which type 3 CPS interferes with antibody-mediated killing and gains protection by anti-CPS antibodies, was compared among Clade I-α (n = 3) and Clade II (n = 3) serotype 3 isolates from Massachusetts to strain WU2 as previously described [50]. To assess the efficiency of antisera against serotype 3 CC180 to opsonize pneumococci for uptake and killing by differentiated polymorphonuclear leukocytes, we used an opsonophagocytic killing assay (OPKA) [51]. The killing assays were performed at multiple dilutions using PCV13 antisera and antisera generated from type 3 polysaccharide (PS) (see S1 Methods) [52]. All Clade I-α and II strains studied in phenotypic assays are marked on the ML phylogeny.

Population structure
The serotype 3 CC180 isolates (n = 301) included in this sample came from 24 countries from North America (38.9%), Western Europe (17.9%), Asia (14.6%), Eastern Europe (14.3%), Africa (7.6%), and South America (6.3%) collected from 1993-2014. The ML phylogeny, inferred from a core genome alignment of 15,327 SNP sites, improved the resolution of the three major lineages identified in previous work. Clade I and the previously described Clade II together form a single monophyletic lineage, distinct from the previously described Clade III (Fig 2A). Applying an out-group roots the phylogeny on the branch between those strains previously described as Clades II and III, indicating that the previous clade naming scheme was inaccurate. To reflect this, and the fact that we now find two rather than three monophyletic lineages, we refer to Clades I-α and I-β (formerly Clades I and II) and Clade II (formerly Clade III). Clade I-α isolates made up the majority of the sample (68.4%), but a significant proportion is made up of Clade I-β (12.3%) and Clade II (19.3%). The expanded sample now shows a single deep branching lineage containing Clade I-β, which is polyphyletic and subtends Clade I-α. Clade II is further subdivided into three well-supported subclades that are distinct in terms of genome content (see below). Following removal of regions that were inferred to have been introduced by recombination, the nucleotide diversity of Clade II was significantly greater than that of Clade I-α [mean pairwise SNP distance 98.0 (SE 2.1) vs. 112.7 (SE 3.6)].

Phylogeography and country-level vaccine use history
The proportion of isolates belonging to Clade II has significantly increased over time from 1999-2001 (11%) to 2011-2014 (41%) (4.2, 95% CI: 1.9-9.0, p<0.0001) with the largest increase occurring in North America (Fig 2B). During the same time Clade I-α has Phylogeny was out-group rooted using strain AP200 (accession # CP002121) a serotype 11A, ST62 S. pneumoniae invasive isolate, which was found to be immediately basal to the CC180 clade in a global phylogeny of pneumococcal reference genomes and therefore represented the closest out-group. Bootstrap values from 100 replicates are labelled on major clades and epidemiologically relevant sub-clades. Additional values can be found in the supplemental phylogeny. Mean pairwise within-and between-clade SNP difference are presented in the grey shaded box with colors corresponding to the phylogeny. B.) World map illustrating sampled countries and regions with respective proportion of isolates belonging to Clade I-α, I-β, and II. Countries are colored according to region and pie charts represent the proportion of isolates belonging to major serotype 3 clades. The size of the pie chart is scaled to the proportion of strains sampled from each region. Mapping was performed using R package maps v3.3.0. C) Proportion of clade membership by three-year collection window. The proportion of clade membership by region over-time is displayed on the top of the figure. Pie charts are scaled by the number of isolates sampled from a geographic region by time window (i.e., column). The overall proportion of clades for each time window is presented on the bottom of the figure. significantly decreased, and Clade I-β remained largely unchanged. In the present sample, Clade II was first reported in Asia in 1999, but is now globally distributed, making up a large proportion of samples from Asia, Africa, and North and South America ( Fig 2B). However, Clade II was only observed twice among 97 European strains (Table 1), first appearing in 2003. Clade I-α was the most prevalent in the sample and found to be globally distributed, while Clade I-β isolates appear more common in samples from South America and Asia ( Fig 2B).
Clades I-α and II had significant temporal signal, identified by root-to-tip date correlation (S1 and S2 Figs). A date randomization test also showed significant temporal signal for Clade II, but date randomizations for Clade I-α failed to reach ESS values >200 despite chain length (S3 Fig). Model comparison using Bayes factors calculated from MLEs identified both Clades I-α and II fit a GMRF SkyGrid demographic model and relaxed molecular clock (S1 Table). In addition, the exponential demographic model was preferred to the constant for Clade II but rejected Clade I-α. Evolutionary rates were not significantly different between the two clades; however, the 95% Highest Posterior Density (HPD) for Clade II was wider (S4 Fig), owing to a greater coefficient of variation (S = 0.38 vs 0.21 for Clades II and I-α, respectively) and indicating higher rate variation among branches. Clade II was significantly younger, with an estimated most recent common ancestor (TMRCA) of 1968 [95% HPD: 1939-1989] (S1 Table). The effective population size (N e ) for Clades I-α and II have been increasing, as demonstrated by the SkyGrid N e plots (Fig 3) and rejection of the constant population size demographic model (S1 Table). Further, the exponential model for Clade II suggested the population was exponentially increasing (mean exponential growth rate = 0.054 [95% HPD: 5.11x10 -3 , 0.11]).
Phylogeographic migration models of Clade II using the structured coalescent achieved sufficient mixing (i.e., ESS values >200) for all parameters; however, posterior probabilities for migration rates between geographic regions were not significant (<0.30). Therefore, we were unable to infer the ancestral geographic locations of Clade II isolates. We noted, however, that in both ML and Bayesian time-scaled phylogenies, the Asian clade made up of isolates from Hong Kong is proximally basal to the major North American clade (Figs 4 and 5). Further, isolates from Hong Kong possess a unique Tn916-like transposon shared only with isolates found in the dominant North American clade ( Fig 5) (see below).
Clade II was found in 11 of 24 countries, 10 of which introduced PCV during our study period (Fig 6). There was no association between countries that introduced PCV13 and the observation of an isolate belonging to Clade II (Fisher's Exact

Recombination and genomic variation
We sought to identify whether the clades we identified above showed differences in the historical contribution of recombination, which has been previously considered important in adaptation to clinical interventions [53]. We found that Clades I-α, I-β, and II experienced substantial recombination, impacting 1,179 CDS and diversifying gene content among clades (S5 and S6 Figs). The ratio of polymorphisms introduced through recombination compared to those introduced by mutation (r/m) for the entire sample was estimated at 1.76 with a mean recombination tract length of 12.3 kb. When independently assessed, we found Clade I-α possessed the lowest r/m among the three tested clades as well as the smallest tract length for recombination events (Table 2). Forty-two unique recombination events affecting 582 CDS occurred on the branch segregating Clade II and I-β resulting in an r/m of 4.0 (S5 Fig). Comparison of r/m values between internal and terminal branches of the phylogeny suggests that the great majority of recombination events in CC180 are located on internal branches, although more recent events have occurred especially in Clade I-β (Table 2). We subsequently examined the comCDE operon, encoding the competence stimulating peptide and two-component regulatory system, and the associated regulatory genes comAB to identify any variations among these loci between the clades. Overall, the comCDE operon was markedly  uniform (mean pairwise SNP distance = 1 SE 0.5); however, we found a non-synonymous mutation in comD (AA104: Pro->Ser) among Clade I-β and II strains, which segregated them from Clade I-α. Further, while competence factor transporting protein comA was identical among the three clades, competence factor transport accessory protein comB was more diverged between Clades I-β/II and Clade I-α (mean pairwise SNP distance = 1.7 SE 0.7), as a result of a recombination event that affected all strains belonging to Clade II. Recombination impacted a substantial proportion of the genome, focused in known recombination "hotspots" (Fig 7). As a result, we observed significant variation in mobile genetic elements (MGE) and gene content, including polymorphic protein antigens, evident through comparison of de novo assemblies and patterns of recombination. Pangenome comparison identified 1,437 core genes, as well as variations in gene content among clades (S7 Fig). Two notable differences in MGE that resulted in gene content variation among CC180 clades were the presence of a Tn916-like conjugative transposon in Clade II strains (discussed below) and  the absence of the 33.3 kb prophage ϕOXC141 in Clades I-β and II [54,55]. The ϕOXC141 prophage, which was putatively acquired by an ancestor of Clade I-α, has been lost multiple times by members of Clade I-α (Fig 5). Notably, it is absent from some Clade I-α isolates from North America, which includes a number of strains from Massachusetts that form a distinct subclade. In all, ϕOXC141 is present in 71% of Clade I-α isolates and 50% of CC180 serotype 3 strains overall.
Among 13 tested polymorphic protein antigens, five were found to be variable, including membrane associated protein SP2194, cell-wall anchored proteins neuraminidase A (NanA) and β-N-acetylhexosaminidase (StrH), and surface exposed proteins pspC and pspA (Fig 5). SP2194 encodes the ATP-binding subunit of Clp protease and is involved in the expression of the comCDE operon, which plays an important role in competence, survival, and virulence of pneumococci [56]. Three internal and three terminal branches contain recombination events that have impacted SP2194. Major events occurred on each internal branch leading to Clades  Population genomics of serotype 3 CC180 pneumococci I-α, I-β, and II, and as a result, SP2194 is significantly diverged among the three clades (mean between clade SNP difference = 31.1 SE 3.8; p-distance 0.013 SE 0.002). Protein antigen genes pspA and pspC are known recombination "hotspots" [57], and here we find that recombination generated variation that subsequently become fixed in dominant clades and sub-clades. The Family 2 pspA variant was largely associated with Clades I-α and I-β, while the Family I variant Population genomics of serotype 3 CC180 pneumococci was associated with Clade II [58]. Using the pspC groupings-based on structural and nucleotide variation proposed by Iannelli et al.-we found that Clades I and II had non-overlapping sets of pspC variation: indeed, Clade I possessed a pspC variant consistent with Groups 7/8/9, whereas Clade II variants included Groups 1/5 and 2/3/6 [59]. Hence, we have good evidence that the clades present divergent protein antigenic profiles to the human immune system.

Antimicrobial-resistance (AMR) associated genes
We found variation among clades in genes associated with AMR. Clade II isolates from Hong Kong (n = 6) and USA (n = 9) harbored a 37.6 kb Tn916-like conjugative transposon possessing tetM and ermB (S8 Fig). These strains also possess chloramphenicol acetyltransferase (cat). For 13 of 15 strains with Tn916, antibiotic susceptibility testing was available in GPS metadata or from previous publications [21]. In general, Tn916 positive isolates exhibited high-level resistance to tetracycline, clindamycin, and erythromycin. However, three USA strains were susceptible to macrolides as the result of a previously described ermBS marker, a missense substitution within ermB that is associated with erythromycin and clindamycin-susceptibility [60]. Additional phylogenetic analysis of 26 core transposon clusters of orthologous genes (COGs) as well as genealogies of Int-Tn, ermB, and tetM illustrated that there were in fact two acquisitions of genetically similar Tn916 transposons among Clade II isolates (S8 Fig). In both of these instances, the transposon was first observed in a Clade II isolate collected from Hong Kong, which lie basal to a clade of USA isolates possessing the respective variant. Other notable differences between clades include the presence of tet32 in eight Clade II isolates, which is distinct from tetM carried on Tn916 transposons. In addition, the multi-drug efflux-pump pmrA was present in almost all Clade I-β and II isolates. We were able to confidently type PBP transpeptidase domains of 270 isolates to predict penicillin resistance. All were predicted to be penicillin-susceptible; however, four Clade II isolates possessed a PBP profile with first-step β-lactam resistance (transpeptidase domain profile 2-0-111, predicted MIC of 0.06) and were later confirmed by broth dilution to have penicillin MICs ranging from 0.05 (n = 1) to 0.06 (n = 3). These four isolates also possessed Tn916. PBP profiles further differentiated CC180 clades, with the majority of Clade I-α isolates possessing profile 2-3-2, and Clades I-β and II strains profile 2-0-2 (S2 Table).

Capsular polysaccharide variation in phenotype
Interrogation of inferred recombination events showed that the CPS loci were unaffected by homologous recombination (Fig 7). Furthermore, while the ML phylogeny estimated from the alignment of CPS loci was clearly segregated between the dominant clades, the average pairwise nucleotide difference was only 1.6 (SE 0.4) SNPs (S9 Fig). However the absence of major genotypic variation does not rule out relevant phenotypic variation, so we assessed phenotype of representative Clade I-α (n = 5) and Clade II (n = 3) serotype 3 strains for three properties previously associated with success: zeta potential, capsule shedding and resistance to neutrophil-mediated opsonophagocytosis. The zeta potential measures surface charge, which varies among capsular type and has previously been associated with differences in prevalence and duration of carriage, with more negatively charged capsules being carried for longer [25], but while the surface charge of clade II isolates was slightly more negative among tested strains this difference was not significant (S10A Fig). In addition, comparison of isolates from Clades I-α and II found no significant difference in capsule shedding (S10B Fig). We did observe differences in neutrophil-mediated opsonophagocytic killing. Clade II isolates were killed in the presence of antisera against both PCV13 and type 3 polysaccharide. In contrast, Clade I-α isolates were not susceptible to killing in the presence of type 3 polysaccharide antisera, and only one out of the three tested Clade I-α isolate was killed in the presence of high titer PCV13 antisera.

Discussion
Serotype 3 pneumococci belonging to CC180, the most dominant serotype 3 clone globally, remain epidemiologically important. Over the last decade, Clade I-α and Clade II have increased in effective population size, with Clade II becoming particularly prevalent in Asia and North America and emerging in other regions where it has historically not been observed. Phylogenetic analysis demonstrates that Clade II is significantly diverged in both core genome nucleotide diversity and genome content, with multiple recombination events having occurred at the base of the lineage. Temporospatial surveillance data, coalescent analysis, and molecular epidemiology evidence suggests that the diaspora of Clade II may have originated in Southeast Asia in the 1990s and subsequently spread worldwide over two decades, concomitantly growing in prevalence. In North America, Clade II now makes up more than half of the post-PCV13 serotype 3 CC180 population. While neither use of PCV at a country level, nor timing on PCV introduction appears to be clearly associated with the rise of Clade II, we find that Clade II has divergent surface protein antigens and increased prevalence of antibiotic resistance compared to the other CC180 clades. These two observations are likely contributors to the recent increase in prevalence of Clade II.
While phylogeographic analysis was unable to infer confidently the ancestral migration events that disseminated Clade II, the presence of Tn916 among strains from Hong Kong and the USA provides molecular evidence of migration from Asia to North America. In at least two instances, distinct Tn916 transposons are found among isolates from Hong Kong, which are basal to strains from North America. This suggests the acquisition of this transposon predates the emergence of Clade II in North America; thus, strains circulating in Asia were the ancestral populations of a large proportion (~50%) of North American Clade II isolates. Further, multiple introductions of serotype 3 Clade II isolates to the USA from Asia may have occurred. Tn916 confers macrolide resistance, with the rare exception of strains possessing a mutation in ermB [61]. Macrolide resistance has previously been described among serotype 3 ST180 isolates from Italy and Japan in 2001 and 2003, respectively, where macrolide-resistant serotype 3 strains are prevalent [19,62,63]. In addition, macrolide and tetracycline resistance ST180 strains have been reported from Taiwan (1997), Spain (2002), and more recently, Canada (2011-14), and Germany (2012) (https://pubmlst.org/). Unfortunately, MLST is not discriminating enough to distinguish between the CC180 lineages we discuss here, and as genomic data are not available for these isolates, we cannot determine how they relate to our sample. However, as macrolide resistance among CC180 serotype 3 isolates is relatively rare and largely isolated to Clade II, the early identification of macrolide resistance strains from Hong Kong, Taiwan, and Japan may represent the ancestral population of Clade II putatively possessing Tn916, further supporting a Southeast Asian origin and implicating antibiotic resistance as a possible contributor to the recent success of Clade II.
As PCV use has previously been associated with major changes in pneumococcal population structure including replacement of vaccine serotypes by non-vaccine serotypes (i.e., postvaccine serotype replacement) [64,65], we considered that serotype 3 may have increased in those settings where PCV was introduced. Moreover, variation in effectiveness of the serotype 3 component of PCV13 among CC180 clades may have disproportionally precipitated an increase in Clade II. Here, we find little evidence that PCV use, albeit at a country level, has shaped the serotype 3 CC180 population. However, there are limitations in these comparisons, as they ignore important factors such as phylogenetic population structure (i.e., whether Clade II was already present), age of cases, and details of vaccine roll-out (e.g., timing, targeted age groups, and vaccine coverage). Unfortunately, these data are incomplete for a number of cases/countries, precluding an in-depth analysis of the putative association of PCV with emergence of Clade II. Nonetheless, since serotype 3 is a non-vaccine serotype in regions where PCV-7, PCV-10, and, arguably, PCV-13 has been introduced, removal of vaccine serotypes may have precipitated the increase in serotype 3. Further, as we posit below, phenotypic and genomic variation between Clades I-α and II may have led to a competitive advantage to Clade II, which did not exist prior to vaccine introduction.
Historically, antibiotic resistance among serotype 3 CC180 strains has been low. This is because the previous population was dominated by isolates from Clade I, which is largely devoid of AMR-associated genes and mutations. This may result from the high invasiveness of serotype 3 and its relatively short carriage duration, which in turn reduces bystander antibiotic exposure through treatment of other infections [66]. Here we find chloramphenicol, macrolide, tetracycline, and first step penicillin resistance more frequent among Clade II isolates when compared to Clade I isolates. It is not clear whether these are directly linked to the recent increase in prevalence and emergence in previously unobserved regions. However, increased macrolide usage in North America has previously been associated with increases in macrolide resistant Staphylococcus aureus and nonsusceptible S. pneumoniae [67,68].
In addition to variations in AMR-associated genes, we also find that MGE and polymorphic protein antigens varied among Clades I-α, I-β, and II. The previously described ϕOXC141 was generally absent among Clades I-β and II and has been lost by a North American sub-clade of Clade I-α [54,55]. Temperate bacteriophages may be associated with fitness defects among pneumococci or changes in virulence and competence [69,70]. Whether the presence or absence of ϕOXC141 relates to the relative success of serotype 3 CC180 clades is unclear. However, the absence of ϕOXC141 from Clade II as well as the loss of the phage by a subclade of North American Clade I-α strains remains notable and warrants further investigation. Contributing to genomic variations among clades, pspA, pspC, and SP2194 (ATP-dependent Clp protease) were found to vary. Natural immunity to protein antigens is generated through nasopharyngeal colonization, and this immunity may protect against colonization and invasive disease [27,71]. Protein antigen variants were generally conserved among Clade I-α isolates and varied in Clades I-β and II. As protein antigen variants are serologically distinct, variations in antigen profiles among CC180 clades could impact relative transmissibility and virulence, resulting in differences in carriage duration, transmission, and invasive capacity [27,59,72,73]. For example, in previous comparisons of Family 1 and 2 pspA variants in isogenic strains of serotype 3 (WU2), Family 2 mutants were slightly less virulent and bound less human lactoferrin, impacting nasopharyngeal colonization [74]. Here, we find that Clade II isolates largely possess the Family 1 pspA variant. Further, variants of pspC differentially bind factor H and generate distinct antibody responses, promoting immune escape [75]. Most isolates in Clade I-α possess a pspC variant corresponding to Groups 7/8/9, which have been grouped here because they share the same structural organization [59]. In contrast, Clades I-β and II possess multiple polymorphic pspC variants interspersed throughout the topology of the clades, suggesting that multiple recent recombination events have generated variation in this known recombination hotspot as well as a diversified antigenic profile. Taken together, differences in antigenic profiles add to the immunological variability among CC180 clades. Under a model of immune selection, where novel and beneficial combinations of protein antigens are positively selected, this could have led to an increase in the frequency of Clade II isolates. While the epidemiological implications of these finding require further investigation, together with antibiotic resistance, antigenic variation presents another possible contributor to the success of Clade II.
The population of serotype 3 has been previously thought to be largely clonal, clustering into a group of closely related PFGE patterns corresponding to the Netherlands 3 -31 clone [76]. Further, evolution of CC180 was thought to be driven primarily through nucleotide substitution and estimates of recombination rates based largely on Clade I-α suggested a comparatively lower rate to other PMEN clones [18], mainly dominated by micro-recombination events [77]. However, the expanded global sample of PMEN31 serotype 3 genomes demonstrates in detail how the prevalent lineages have diverged following significant ancestral recombination events. While comparatively rare, recombination continues to shape the CC180 population, as terminal branches in all three clades show evidence of recent recombination events. This inferred clonality likely resulted from the limited sample of type 3 isolates from restricted geographies. Further, while serotype 3 is generally understood to show lower competence than other serotypes based on in vitro studies [18,78], we found that the emergent Clades II and I-β have more evidence of historical recombination events (ρ/Θ) and a greater proportion of macro-recombination events than the previously dominant Clade I-α. The Clade I-α comCDE operon generates an antisense transcript, an observation that could be relevant to the low historical recombination rate of Clade I-α [18]. Amino acid changes in the comCDE, specifically comD, also segregate Clade I-α from I-β, and II. We also found variations in the Clp protease gene, which regulates the comCDE operon, and is therefore associated with competence rates among pneumococci and other bacteria [56,79]. Increased recombination rates in Clade II have the potential to generate a more diverse antigenic profile, which would be beneficial under diversifying selection from the immune system. More recombination also facilitates the acquisition of multi-drug resistance [80,81]; indeed, higher rates of recombination have previously been associated with antimicrobial resistance [82]. Negative frequency-dependent selection (NFDS) has been proposed as a mechanism for maintaining intermediate frequency loci among pneumococci [83], and together with host immunity, may also explain serotype coexistence [84] and vaccine-induced metabolic shifts [85]. One possible explanation for the relative success of Clade II, is that it possessed a favorable repertoire of accessory loci, protein antigen variants, or metabolic loci, which were selected for in the post-vaccine landscape. The exact loci which are acted upon by NFDS are yet unknown; however, it could involve immune selection of protein antigens, antibiotic resistance conferring gene, and bacteriophage predation defence [12,83]. Here, we find variation in all three between Clade I and II.
Because serotype 3 pneumococcal strains possess a mucoid capsule and release significantly more capsular polysaccharide during in vitro growth and infection in mice, it is hypothesized that this soluble polysaccharide absorbs anti-capsular antibody, effectively impeding antibodymediated killing in vitro and in vivo [24,26,52]. We found that Clades I-α and II both shed CPS during growth; however, the amount of CPS released did not significantly differ between the two clades. Another phenotype that has been found to correlate with increased survival due to resistance from phagocytosis by human neutrophils, and contribute to success in nasopharyngeal carriage is low surface charge, measured as zeta potential [25]. We therefore compared this between Clades I-α and II and found no significant difference in charge between clades, suggesting their capsular properties are similar. Where we did find significant phenotypic differences between the clades, it was not easy to relate these differences to the apparent increase in Clade II following PCV13 introduction; we initially hypothesized that antisera against PCV13 and type 3 polysaccharide might be more effective against isolates from Clade I-α, explaining its apparent decline. However, for both Clades I-α and II, neutrophil-mediated opsonophagocytic killing occurred only at high antibody levels. Clade I-α isolates appeared to be resistant to killing by PCV13 antisera, while Clade II isolates were more readily killed. The resistance of Clade I isolates to opsonophagocytic killing is consistent with the low efficacy of PCV13 on serotype 3 invasive disease incidence [10,86], but difficult to link to the recent emergence of Clade II. However, these results should not be considered a direct representation of vaccine effectiveness, because opsonophagocytic killing may be a poor proxy for the ability to colonize and transmit. Moreover, opsonophagocytic killing only occurred at high titers. Our findings are consistent with those of other groups, suggesting that higher antibody concentrations are needed for killing in vitro [52]. They also demonstrate that closely related bacteria with the same capsule do not necessarily behave similarly in the face of vaccination.
Together, our findings further support the involvement of anti-protein antibodies in mediating anti-serotype 3 immunity. Anti-protein antibodies may act in synergy in vivo with anticapsular antibodies to mediate pneumococcal killing. This is an intriguing hypothesis in light of our other findings about the differences in protein variant composition among serotype 3 strains. These differences in protein composition underscore the need for better evaluation of the role that anti-protein antibodies have in mediating anti-pneumococcal protection. Overall, the role of these in recent dynamics of serotype 3 epidemiology requires further study.
Serotype 3 remains an epidemiologically important serotype, continuing to cause invasive disease and increasing in carriage prevalence in North America. Here, we explore multiple epidemiological, genotypic, and phenotypic factors associated with the persistence of serotype 3 in carriage and disease. We found that the recent success of CC180 is related to an increasing clade, which is distinct from the previous population's antigenic composition, antibiotic susceptibility, and competence. Based on OPKA, we also found evidence to support the low-observed efficacy of PCV13 against serotype 3, which was previously dominated by CC180 Clade I-α in Europe and North America. Optimistically, the efficacy of PCV13 against the emergent Clade II may be higher; however, more in vitro and in vivo studies are required to confirm this finding.
Our study is not without limitations. While we base our analysis on the largest collection of extant CC180 genomes, geographic and temporal sampling variations exist. To mitigate coalescent analysis bias, we down-sampled according to published approaches. Additionally, while the clonality and low diversity of Clade I-α may, at first approximation, suggest a more recent origin, HPDs for the TMRCA of Clade I-α and II estimated through coalescent analysis are non-overlapping and correspond to previously published estimates [18]. Further, despite modest sampling from multiple geographies prior to 1999 (n = 34), Clade II isolates were not observed. Our results instead indicate that the higher recombination rate and recent population expansion of Clade II has led to its greater diversity. Lastly, our genomic results may be affected by the accuracy of bioinformatic programs. Wherever possible, results were verified using multiple approaches or manual inspection of the data.
Overall, our analysis emphasizes the need for routine, representative sampling of isolates from disperse geographic regions, including historically under-sampled areas. We also highlight the value of genomics in resolving antigenic and epidemiological variations within a serotype, which may have implications for future vaccine development. As PCV usage and antibiotic consumption expands globally, it is imperative to continue genomic and epidemiological surveillance of pneumococci to detect the emergence of new lineages and monitor changes in clinical presentation and severity, as these data have direct implications for prevention and management of pneumococcal disease.  Int-Tn, ermB, and tetM (bottom). The core genome phylogeny and genealogies are labeled with length of alignment, mean pairwise difference (mpwd), and the ratio of non-synonymous to synonymous mutations (dN/dS). The phylogenies show a clear delineation between two distinct transposons (blue and black tip labels). In the ermB genealogy, two isolates that are macrolide susceptible are colored in green.  Table. Results of Bayesian phylogenetic analysis using Beast 1.8.4 for Clades I-α and II. For each clade, strict (SC) and relaxed (RC) molecular clock models were compared for constant, exponential, and GMRF Skygrid demographic models. Log marginal likelihood estimates (MLE) from path-sampling (PS) and stepping-stone (SS) analysis were used to calculate Bayes Factors for model comparison. Log Bayes Factors (BF) are specified for each molecular clock and demographic model comparison. The date of the most recent common ancestor (TMRCA) and evolutionary rate, scaled in SNPs/site/year are presented for the final model with corresponding highest posterior densities (HPD). (DOCX)