Sequence Analysis of 96 Genomic Regions Identifies Distinct Evolutionary Lineages within CC156, the Largest Streptococcus pneumoniae Clonal Complex in the MLST Database

Multi-Locus Sequence Typing (MLST) of Streptococcus pneumoniae is based on the sequence of seven housekeeping gene fragments. The analysis of MLST allelic profiles by eBURST allows the grouping of genetically related strains into Clonal Complexes (CCs) including those genotypes with a common descent from a predicted ancestor. However, the increasing use of MLST to characterize S. pneumoniae strains has led to the identification of a large number of new Sequence Types (STs) causing the merger of formerly distinct lineages into larger CCs. An example of this is the CC156, displaying a high level of complexity and including strains with allelic profiles differing in all seven of the MLST loci, capsular type and the presence of the Pilus Islet-1 (PI-1). Detailed analysis of the CC156 indicates that the identification of new STs, such as ST4945, induced the merging of formerly distinct clonal complexes. In order to discriminate the strain diversity within CC156, a recently developed typing schema, 96-MLST, was used to analyse 66 strains representative of 41 different STs. Analysis of allelic profiles by hierarchical clustering and a minimum spanning tree identified ten genetically distinct evolutionary lineages. Similar results were obtained by phylogenetic analysis on the concatenated sequences with different methods. The identified lineages are homogenous in capsular type and PI-1 presence. ST4945 strains were unequivocally assigned to one of the lineages. In conclusion, the identification of new STs through an exhaustive analysis of pneumococcal strains from various laboratories has highlighted that potentially unrelated subgroups can be grouped into a single CC by eBURST. The analysis of additional loci, such as those included in the 96-MLST schema, will be necessary to accurately discriminate the clonal evolution of the pneumococcal population.


Introduction
Disease-causing pneumococci represent a phenotypically and genotypically diverse population of strains that cause bacteraemia, meningitis, pneumonia, sinusitis, and acute otitis media in children [1][2][3][4]. Effective epidemiological surveillance, along with the characterization and the classification of the circulating strains, are important tools to understand the evolution and the population dynamics that support the success of specific pneumococcal lineages [5][6][7][8][9]. The capsule, of which there are 94 known types, is the most important pneumococcal virulence factor, as well as the target of current licensed vaccines and the most widely used single parameter for epidemiological typing of pneumococcal strains [10][11][12][13][14].
Within recent years, the accessibility of sequence analysis tools has increased the diffusion of Streptococcus pneumoniae molecular genotyping methods such as Multi Locus Sequence Typing (MLST) (http://spneumoniae.mlst.net/) [15,16]. S. pneumoniae MLST is based on sequence data from standardized fragments of seven housekeeping genes; each unique allele is identified by a numerical ID, and the allelic profile at the seven loci is used to classify bacterial isolates into Sequence Types (STs) [17]. The MLST classification reveals important insights into the geographic spread of successful pathogenic clones and also the emergence of associations between STs and serotypes, combinations that can be traced back to serotype switching events [18][19][20][21][22].
The use of specifically designed algorithms such as eBURST, shows that MLST-related strains can be further grouped into Clonal Complexes (CCs) ideally including only genotypes descending from a common predicted founder [23,24]. This type of analysis allows studying the temporal evolution of lineages with respect to capsular serotype switch events, antibiotic resistance acquisition and geographic distribution [5,18,25]. In addition, isolates belonging to the same ST or CC have been demonstrated to inherit specific genetic traits, like the two pilus encoding islets (PI-1 and PI-2), the pneumococcal pathogenicity island psrP-secY2A2, and pcpA (Pneumococcal choline binding protein A) [26][27][28][29][30][31]. Recently, formerly distinct CCs have merged into larger and extremely heterogeneous CCs due to the extensive analysis of S. pneumoniae strain collections from different geographical regions and the consequent identification of several rare STs. One example is CC156 (the largest clonal complex in the MLST database), whose predicted founder based on eBURST analysis is ST156, although a recent report based on its penicillin susceptibility profile suggests that ST162 is the true ancestral ST [25].
Whether or not CC156 still represents the evolutionary descent from a predicted ancestor or is an example of artificial grouping due to the reduction of the discriminatory power of the eBURST algorithm, is a matter of scientific interest. The application of alternative typing methods with respect to MLST would be of paramount importance to unravel this point. Whole genome sequence analysis has been successfully applied to S. pneumoniae and has provided a clear indication that pneumococcal genomic variability is most likely triggered by homologous recombination events [32,33]. In addition to whole genome sequencing, other molecular typing methods such as multiple-locus variable number tandem repeat analysis (MLVA), multilocus boxB sequence typing (MLBT) and 96-MLST, able to trace capsular switch or other homologous recombination events, have been proposed [15,[34][35][36].
In order to assess whether distinct genetic lineages were present in CC156, we applied 96-MLST to a panel of strains representative of 41 different STs belonging to CC156. The application of the 96-MLST schema allowed for the distinction of ten lineages within CC156, each homogeneous in terms of capsular type and for the presence of clonally inherited genetic traits (such as the presence of PI-1). Noteworthy, strains belonging to ST4945, whose recent discovery had been responsible for the merging of distinct clones into CC156, were unequivocally assigned to one of the lineages. Moreover, two out of the three ST4945 SLV analyzed were assigned to lineages different from ST4945, suggesting that strains in close proximity in the eBURST graphic representation (and differing in only one of the seven MLST alleles) can be significantly different when considering additional loci interspersed in the whole genomic backbone.

Strain Collection
The collection of S. pneumoniae clinical isolates analysed in this study was representative of the allelic (ST) diversity of CC156 strains. The following information on the 66 strains used in this study is listed in Table 1: name, serotype, ST, geographical origin, the number of MLST alleles in common with ST156 (the CC156 eBURST predicted founder) and with ST4945, data source and strain source. Overall, the strains analyzed belong to 41 CC156 STs and are associated with 14 capsular types. The isolate collection was composed of 8 strains for which either complete or draft genomic sequence was available and 58 additional isolates for which new sequence data were generated (see below). The 58 isolates already characterized for serotype and ST were kindly provided by many laboratories (complete list in the ''Acknowledgments'' section).

Genomic DNA Extraction and PI-1 Presence Assessment
Bacteria were grown overnight at 37uC in 5% CO 2 on Tryptic Soy Agar plates (TSA) (Becton Dickinson) supplemented with 10 mg/l colistine, 5 mg/l oxolinic acid and 5% defibrinated sheep blood. Genomic DNA extractions for the 58 isolates were performed by using the Wizard Genomic DNA purification kit following the manufacturer's instructions (Promega). PCR amplifications for 96-MLST were performed on genomic DNA as briefly described below, while the presence of PI-1 and the PI-1 clade were determined as reported elsewhere [27].

96-MLST
The 96-MLST loci and the amplification primers for the 96-MLST schema reported in Tables S1 and S2 were selected and designed as detailed elsewhere [34]. Briefly, the 96 loci were identified within the S. pneumoniae core genes by comparing the available complete and draft genomes of 39 S. pneumoniae invasive strains and selected for being short variable regions (ranging from 847 to 209 bp) flanked by conserved regions. For the 8 complete and draft genomes (Table 1) of S. pneumoniae the nucleotide sequences of the 96 loci were retrieved by in silico analysis; a locus was considered to be absent when a primer aligned against the genome sequence with more than three mismatches to guarantee consistency with the results of the PCR amplifications. For this reason, the two loci SP0180 and SP0181 were not extracted from the SP195 genome.
The 96-MLST loci of the 58 S. pneumoniae isolates were PCR amplified in a microtiter 96-well plate by using a unique amplification condition (55uC annealing temperature, 1 minute elongation) with the primer set reported in Table S2 [34]. The PCR products were then purified with paramagnetic beads (Agentcourt AMPure XP, Beckman Coulter Genomics, USA) and both strands sequenced with the same amplification primers by an ABI 3730xl DNA Analyzer (Applied Biosystems, USA). Chromatogram traces were edited and assembled with Vector NTI Advance 11 (Life Technologies). Consensus sequences were aligned using MUSCLE 3.8.31 [37]. The SP0278, SP1785, SP1909, SP2141 and SP2198 loci were not amplified on 2, 4, 1, 1 and 3 strains respectively. The complete nucleotide sequence dataset is provided in the File S1.

Hierarchical Clustering and Minimum Spanning Tree Analysis
Sequences were converted into allelic profiles (Table S3), by assigning a unique ID number to each allele. When a primer pair did not amplify, the absent locus was assigned the ID number ''00. With this choice, strains sharing the same deletions or divergent sequences in the primer region were considered more similar than what would be by treating these loci as missing data, thus allowing the use of the information in the phylogenetic analysis.
Hierarchical clustering was performed using the package Cluster v1.13.1 [38] of the software package R v2.12.0 (www.rproject.org/). Distances between strains were computed using the function ''Daisy'' with Gower's distance, counting the number of differences between allelic profiles. An agglomerative hierarchical clustering of the data was performed using the function ''Agnes'' with ''average'' (unweighted pair-group average method -UPGMA) method. Support of the clustering was assessed using bootstrap.

Phylogenetic Analysis
For each of the 96 loci the sequences were aligned using MUSCLE [37]. Aligned sequenced where concatenated, and phylogenetic analysis was performed using Mega5 [40] with the Neighbor Joining method [41].

ClonalFrame Analysis
ClonalFrame V1.1 [42] was run on the aligned sequences of the 89 loci that where present in all strains. Seven independent runs of  10 6 iterations were performed, and the first half of each run was discarded. When compared, the distributions of some of the parameters from the seven runs failed to fulfill the convergence requirement of a Gelman and Rubin statistics below 1.2, suggesting that the evolutionary model implemented by Clonal-Frame gives a poor description of these data. The posterior samples of the tree topologies of the seven runs were combined, and a consensus phylogenetic network was generated using SplitsTree v4.10.

Results
The Identification of ST4945 Strains has Been Sufficient to Cause the Merger of Three Independent S. pneumoniae Lineages into One Clonal Complex, CC156 CC156 (predicted founder ST156) was identified as the largest clonal complex in the S. pneumoniae MLST database (accessed on 15th January 2012) by running the eBURST algorithm with the default settings on the complete dataset which, at the time of the analysis, comprised 7146 distinct STs (ST1-ST7160). CC156 presented a complex and heterogeneous structure ( Figure 1A and [25]), which did not change by using other algorithms such as goeBURST [23], encompassing 13.8% (986 STs) of the total STs in the database (as a comparison the second largest CC, CC320, included only 235 STs, 3.3% of the total). Interestingly, based on analyses performed in 2008, CC156 comprises STs which formerly belonged to distinct complexes: CC124, CC146, CC162, CC176, and CC392. These CCs were associated with different serotypes and differed for the presence of PI-1 and for PI-1 clade [27]. As detailed elsewhere [25,27,43], CC162 strains were associated with serotypes 9V and 14 and were PI-1 positive (PI-1 clade I), CC124 and CC392 strains were associated with serotypes 14 and 17F and were PI-1 negative, CC146 strains were associated with serotype 6B and PI-1 positive (clade II) and CC176 strains were generally associated with serotype 6B and 23F and heterogeneous for the PI-1 presence (clade II).
To investigate whether the merger of multiple CCs into a single CC was due to the identification of one or multiple new STs, the eBURST algorithm was iteratively executed by progressively excluding the most recently identified STs from the analysis. This resulted in the observation that ST4945 (a new allelic combination), occupied a central position within the eBURST CC156 graphic representation (Figure 1 and Figure S1), and had been sufficient to induce the merger of three formerly distinct CCs: CC162, CC124 and CC176 into one larger CC ( Figure 1A). At the time of this analysis, only two ST4945, both serotype 17F from two different countries were present in the MLST database. Noteworthy, the CC124 and CC176 identified by excluding ST4945 from the analysis comprised the formerly separated complexes CC124 and CC392, and CC176 and CC146, respectively [27].
To further investigate whether the strains comprising the newly formed CC156 had a common evolutionary descendent (ST156) or whether ST4945 strains contained a combination of genetic alleles from different lineages (as suggested by their MLST profile), we analyzed a panel of 66 representative strains of different STs belonging to CC156 (see materials and methods section, Table 1, Figure 1 and Figure S1). The strains isolated in different countries belonged to 41 STs and shared between zero and six MLST alleles with ST156 ( Figure 1B) despite belonging to the same CC. As shown in Figure S1A, all of the strains displayed the PI-1 distribution expected based on the lineage partitioning present before the introduction of ST4945 (see Figure 1A and [26,27,43]). The two ST4945 strains were also part of the collection as well as single and double locus variant (SLV, DLV) strains of ST4945 ( Figure S1B). Figure 2. Hierarchical clustering performed on the 96-MLST alleles identifies ten genetically distinct evolutionary lineages (a-j) within the 66 CC156 strains analyzed. Sequences were converted into allelic profiles assigning a unique ID number to each allele. Hierarchical clustering was performed using the package Cluster v1.13.1. Distances between strains were computed using the function ''Daisy'' with Gower's distance, counting the number of differences between allelic profiles. An agglomerative hierarchical clustering of the data was performed using the function ''Agnes'' with ''average'' (unweighted pair-group average method -UPGMA) method. The ten lineages identified (a-j) are indicated by coloured boxes, and numbers represent the bootstrap support. The STs of all the strains are indicated in the coloured bar. doi:10.1371/journal.pone.0061003.g002 Hierarchical Clustering of the 96-MLST Alleles Identified Ten Genetically Distinct Evolutionary Lineages within the CC156 Strain Panel The 66 strains were typed by using the 96-MLST schema [34]. Following amplification and sequencing, the sequences were converted into allelic profiles (Table S3). The minimum number of distinct alleles was identified in the SP0841 locus (6 alleles), while the maximum number of alleles was identified in the SP334 and SP2194 loci (28 alleles). By performing hierarchical clustering of the 66 strains using the number of loci with different alleles as a measure of their genetic distance it was possible to identify ten genetically distinct lineages supported by high values of bootstrap (Table 1, Figure 2 and Figure S2A). As shown in Figure 2, strains with the same ST were always assigned to the same lineage, while being SLV was not predictive for the assignment to the same lineage. Indeed, ST392, ST789 and ST4404 were all SLVs (in different MLST alleles) of ST4945 strains but only ST392 was assigned to the ST4945 lineage; the SLVs ST273 and ST385 were assigned to different lineages as well as the SLVs ST2218 and ST176; finally, ST361 and ST2218 (in a DLV relationship) both SLV of ST171 were allocated into the same lineage, different from that of ST171 (Figure 2). In addition, similar lineages could be identified also by using ClonalFrame (Figure S2B). Small differences (e.g. the position of strains 08809744 and 08802447) with respect to the results obtained with hierarchical clustering could be attributed to the fact that, despite having run 7 independent simulations for a total of 7*10 6 iterations, incomplete sampling of the tree topologies was achieved. Remarkably, a phylogenetic analysis performed with the same dataset by aligning the 96 loci concatenated sequences, identified the same lineage pattern ( Figure S3).
As expected, by computing a hierarchical clustering analysis with the MLST allelic profiles for the same set of 41 STs, SLV STs were grouped in the same lineage ( Figure S4). This was also true when the phylogenetic analysis was performed with the sequences of the seven MLST loci ( Figure S5). However, for the 7-MLST analysis the groupings obtained with the two methods were not exactly the same (i.e. for the two SLV STs 4966 and 5420).

ST4945 can be Assigned to One of the Identified Lineages by the 96-MLST Allelic Profile Analysis
The partitioning of CC156 into ten distinct lineages by hierarchical clustering was further evaluated by performing a Minimum Spanning Tree (MST) analysis of the 96-MLST alleles (Figure 3), and by creating a visual diagram of the allelic assortment within and across the identified lineages for both the 7 and the 96-MLST profiles ( Figure 4). As shown in Figure 3, by applying a threshold of 75 loci (i.e. by cutting those links in the MST diagram that connected strains differing by more than 75/96 loci), seven distinct lineages could be identified. Notably, these lineages corresponded to those obtained with the hierarchical clustering analysis, although the strains belonging to lineages ''a'', ''b'', ''c'', ''e'' and ''f'' were grouped by the MST analysis in two larger groups, ''a+b+c'' and ''e+f''. Indeed, the lineages ''a'' and ''b'', the lineages ''b'' and ''c'' and the lineages ''e'' and ''f'' were connected in the MST by strains differing in 56, 60 and 50 loci, respectively. In addition, the two ST4945 strains were unequivocally assigned to lineage ''h'' and the ST392 strain (SLV of ST4945 in the MLST classification) was the closest to the ST4945 strains according to 96-MLST, with 25 different alleles.
In order to visualize whether the allelic differences within and among lineages were concentrated in specific regions of the chromosome, and thus likely attributable to single recombination events, the strains were partitioned based on the lineages identified by hierarchical clustering analysis (Figure 4, columns). The alleles of the 7-MLST and 96-MLST loci (Figure 4, rows) were then colour coded by assigning the same colour to identical alleles within a locus. Since the rows in Figure 4 are ordered according to the position in the chromosome blocks of consecutive loci showing a colour pattern discordant with the rest of the chromosome or with other lineages are indicative of localized intra or interchromosomal differences, respectively. Overall, the distinction among genetic lineages was not due to differences present in specific hyper-variable regions of the chromosome. In fact, the differences were dispersed among the 96 loci and, apart from a few exceptions, each lineage contained a specific repertoire of alleles. Interestingly, lineages ''d'', ''e'' and ''h'', comprising ST4945 strains and SLV and DLV strains of ST4945 (black and blue arrows in Figure 4) shared several alleles in the loci probed by the 7-MLST typing schema, but were clearly distinct at the level of the 96-MLST loci. Besides, lineages ''a'', ''b'' and ''c'' were the most heterogeneous, presented several unique loci and numerous alleles in common (thus justifying the fact they were classified into a unique cluster by MST and the lower support of the branching separating them in the hierarchical clustering and in the consensus network obtained from the posterior sampling of the tree topologies generated by ClonalFrame). A similar situation was shared by lineages ''e'' and ''f'', although these two lineages presented a lower number of unique alleles.
With the exception of lineages ''a'' and ''b'', all of the strains belonging to the same lineage were also closely related in the eBURST graphic visualization of CC156 ( Figure 5). In detail, ST2218, spatially separated from the other lineage ''a'' strains, is both SLV of lineage ''a'' ST172 and of lineage ''b'' ST176, but had 70 and 31 96-MLST loci in common with the closest ST172 strain 08b09744 (lineage ''a'') and ST176 strain PB011 (lineage ''b''), respectively. Besides, ST5613, ST4966 and ST5420 were separated from the other lineage ''b'' strains. Noteworthy, based on MST analysis, strains belonging to these three STs were closer to one another than to other strains in the collection, being the ST171 strain BG02112 the closest to the ST4966 strain 1681 (46/ 96 alleles in common).
In addition, as shown in Figure 5, the majority of the lineages were homogeneous for the presence/absence of PI-1 and, when PI-1 was present, strains within the same lineage carried, (as a confirmation of their phylogenetic proximity) the same PI-1 clade.

Discussion
The implementation of increased regional surveillance combined with molecular typing methods has allowed for the identification of several successful S. pneumoniae clones with a higher invasive disease potential and an ability to spread across different geographical regions. These clones are variably associated with antibiotic resistance and some have been reported to rapidly evolve over time [5,9,16,18,44].
The genetic characterization of S. pneumoniae strains has contributed to the recent progress in pneumococcal biology; however, the genetic traits that allow for the success of specific clones and those responsible for the diversity observed in the S. pneumoniae population are still not completely identified. Out of the several typing methods available to characterize the pneumococcus, MLST is the most widely used for its ability to discriminate bacterial strains. In the MLST framework, relatedness among STs is inferred by methods of cluster reconstruction, or by simple models of clonal expansion and diversification to infer patterns of evolutionary descent, as done by eBURST. However, analyses of this nature must be undertaken with caution due to the potential for recombination events to obscure the evolutionary history of linked groups of strains. By reshuffling the MLST loci, recombination can produce combinations of alleles that cause the merger of unrelated lineages of clonal descent into large, heterogeneous CCs. The probability of the occurrence of these events increases when bacterial collections expand and large numbers of new STs are identified leading to a reduction of the discriminatory power and practical utility of the eBURST algorithm.
An example of this event is clonal complex CC156, which includes a large and heterogeneous group of strains that in many cases differ in all MLST loci, but nevertheless are connected by a continuous path of SLVs. In this report we provide evidence that the identification of a new ST (ST4945) was sufficient to induce the merger of formerly distinct CCs (here at least three) into one single clonal complex. Interestingly, as reviewed by Feil et al [24], nine of the 27 recognized PMEN clones that have contributed to the increase of antimicrobial resistance worldwide (Pneumococcal Molecular Epidemiology Network) [45] are included in this single CC.
In order to discriminate pneumococcal strains within this newly formed CC156, we used a recently developed typing schema based on the sequencing of 96 variable loci belonging to the core genome of S. pneumoniae [34]. We found that CC156 can be partitioned into ten genetically and evolutionary distinct lineages homogenous for capsular serotypes and for the presence of PI-1, a genomic region reported to be clonally inherited [26,27,43]. Notably, the identified lineages correspond to further partitioning of distinct clonal complexes existing before the identification of ST4945, thus suggesting that these complexes might have originated by artificial grouping due to STs over-sampling. As a demonstration of the higher discriminatory power of 96-MLST, hierarchical clustering, ClonalFrame and phylogenetic analysis of the 96-MLST data resulted in the same grouping of strains, whereas in the case of MLST, probably due to the lower number of probed loci and the different grade of variability of the analysed loci, they did not.
To further support the existence of distinct lineages within CC156, we provide evidence that the diversification of the identified lineages is not due to single recombination events occurring at the level of specific genomic regions, but rather by general sequence variability dispersed along the bacterial chromosome. ST4945 strains were unambiguously assigned to one of the identified lineages (containing also some SLV and DLV of ST4945), suggesting that ST4945 could represent an example of multiple recombination events occurring at the level of MLST loci.
In conclusion, exhaustive MLST typing of large collections of pneumococcal strains has led to the identification of new STs and to the reduction of the discriminatory power of the classical eBURST approach. The analysis of additional loci (such as those included in the 96-MLST schema or of the complete genome) will allow for the reconstruction of the clonal structure and increase the ability to infer evolutionary relationships within the pneumococcal population. Figure S1 Graphic representation of CC156 by e-BURST. A) CC156 is heterogeneous for the presence of PI-1. B) 20 out of the 41 CC156 STs analyzed have three or less than three alleles in common with ST4945. MLST database was accessed on 15h January 2012 and CC156 visualized using eBURST (e-BURST algorithm was run on a dataset comprising all the STs in the database represented once). A) PI-1 presence and PI-1 clade analysis was assessed on all the STs analysed. The STs analysed in this study are highlighted and colour coded based on PI-1 presence as indicated in the Figure. B) The STs analysed in this study are highlighted and colour coded based on the number of 7-MLST alleles in common with ST4945 (colour coding is indicated in the Figure).  Figure S4 Hierarchical clustering performed on the 7-MLST alleles of the 41 CC156 STs analyzed. Hierarchical clustering was performed using the package Cluster v1.13.1. Distances between strains were computed using the function ''Daisy'' with Gower's distance, counting the number of differences between allelic profiles. An agglomerative hierarchical clustering of the data was performed using the function ''Agnes'' with ''average'' (unweighted pair-group average method -UPGMA) method. (TIF) Figure S5 NJ phylogenetic tree of the 41 CC156 STs analyzed in this study based on the concatenated sequences of the seven MLST loci. (TIF)

Supporting Information
Table S1 Description of the 96-MLST loci set. ID locus name, short description, locus length, coordinates of start and stop on the TIGR4 genome, and number of alleles identified in this study are reported for each locus.