Clinical Clostridium difficile: Clonality and Pathogenicity Locus Diversity

Clostridium difficile infection (CDI) is an important cause of mortality and morbidity in healthcare settings. The major virulence determinants are large clostridial toxins, toxin A (tcdA) and toxin B (tcdB), encoded within the pathogenicity locus (PaLoc). Isolates vary in pathogenicity from hypervirulent PCR-ribotypes 027 and 078 with high mortality, to benign non-toxigenic strains carried asymptomatically. The relative pathogenicity of most toxigenic genotypes is still unclear, but may be influenced by PaLoc genetic variant. This is the largest study of C. difficile molecular epidemiology performed to date, in which a representative collection of recent isolates (n = 1290) from patients with CDI in Oxfordshire, UK, was genotyped by multilocus sequence typing. The population structure was described using NeighborNet and ClonalFrame. Sequence variation within toxin B (tcdB) and its negative regulator (tcdC), was mapped onto the population structure. The 69 Sequence Types (ST) showed evidence for homologous recombination with an effect on genetic diversification four times lower than mutation. Five previously recognised genetic groups or clades persisted, designated 1 to 5, each having a strikingly congruent association with tcdB and tcdC variants. Hypervirulent ST-11 (078) was the only member of clade 5, which was divergent from the other four clades within the MLST loci. However, it was closely related to the other clades within the tcdB and tcdC loci. ST-11 (078) may represent a divergent formerly non-toxigenic strain that acquired the PaLoc (at least) by genetic recombination. This study focused on human clinical isolates collected from a single geographic location, to achieve a uniquely high density of sampling. It sets a baseline of MLST data for future comparative studies investigating genotype virulence potential (using clinical severity data for these isolates), possible reservoirs of human CDI, and the evolutionary origins of hypervirulent strains.


Introduction
Clostridium difficile infection (CDI) is a major concern in healthcare settings worldwide. Symptoms range from mild diarrhoea to life threatening pseudomembranous colitis, with 6% mortality overall, rising to 13.5% in older patients [1]. Individuals may be asymptomatically colonised in the community, or acquire the bacteria nosocomially [2]. Risk factors predisposing colonised patients to develop symptoms include antibiotic treatment and advanced age [3][4][5].
The major C. difficile virulence factors are large clostridial toxins designated toxin A (TcdA) and toxin B (TcdB). TcdA and TcdB share 63% amino acid sequence similarity [6] and four functional domains; a N-terminal catalytic domain, an autocatalytic cysteine protease, a hydrophobic membrane translocation domain and a Cterminal receptor binding domain (RBD) [7,8]. Evidence that TcdB alone is essential for virulence has been provided [9], however, more recent data indicate that both toxins are important [10]. TcdA and TcdB are encoded within the 19.6kb pathogenicity locus (PaLoc), together with three additional genes; tcdC, tcdR and tcdE. PaLoc gene expression is growth phase dependent. During early logarithmic growth, high levels of tcdC and low levels of tcdA, tcdB, and tcdR are transcribed; during stationary phase the converse is true [11][12][13]. It is therefore thought that TcdC and TcdR are negative and positive regulators, respectively, of toxin expression [11,14].
The molecular epidemiology of C. difficile has been studied using many different genotyping methods [15][16][17]. This led to the identification of epidemic and hypervirulent genotypes associated with increased morbidity and mortality. One such strain emerged in 2000-2001 [18] causing large CDI outbreaks with high mortality [19][20][21]. This strain is designated BI by restriction endonuclease typing (REA), NAP1 by pulsed field gel electrophoresis (PFGE), ST-1 by multilocus sequence typing (MLST) and 027 by PCR-ribotyping [18,19,21,22]. The production of toxin in vitro by PCR-ribotype 027 has been described as robust, but not significantly different to non-hypervirulent strains [23], and as 16 to 23-fold higher than non-epidemic strains [24]. In a human gut model that simulates CDI, the duration of cytotoxin production by PCR-ribotype 027 was markedly longer than that of PCR-ribotype 001 (23 versus 13 days), and was associated with increased prevalence of vegetative cells, but peak toxin titres were similar [25]. PCR-ribotype 027 also shows increased sporulation efficiency [23,26].
PCR-ribotype 078 has also been described as hypervirulent since it can cause symptoms of similar severity to 027 [27]. This PCR-ribotype produces less TcdA and TcdB in vitro than 027, but more than other toxinotypes [28]. PCR-ribotype 078 is frequently isolated from livestock [28] and its incidence in human disease appears to be increasing [17,29].
Two characteristics of the PCR-ribotype 027 PaLoc have been proposed to explain its virulence. Firstly, the 027-tcdB-RBD is genetically divergent from other strains, apparently conferring broader cell tropism and more rapid cell entry [30][31][32]. Secondly, the tcdC gene has a single nucleotide deletion causing a frameshift that truncates the protein. This has been postulated to remove log phase repression of toxin expression [18,24,33,34]. PCR-ribotype 078 also encodes a truncated TcdC [34], caused by a single nucleotide substitution creating a stop codon. The precise frequency and distribution of these potential hypervirulencepromoting PaLoc variants within the C. difficile population structure is unclear. This is due to the lack of recent large scale studies assessing simultaneously the clinical C. difficile population structure, and the nucleotide sequences of PaLoc variants.
The C. difficile population structure is clonal [22,35,36], comprising five genetic groups or clades [22] which persist despite homologous recombination [37]. Existing data suggest a congruent relationship between tcdC variant and genotype [34,36,38], and possibly a similar relationship for TcdB-RBD, although these data are more limited [36]. Many genotypes, representing all five clades [22] are currently associated with CDI [17,22,29,39], and data on their relative pathogenicity would assist patient management and infection control. In particular, the incidence of PCRribotype 027 has declined recently in many countries, and the virulence potential of the now endemic PCR-ribotype 027 relative to other endemic genotypes is less clear [5,40].
Our aims were to define the TcdB-RBD and TcdC variants for 1290 recent clinical isolates collected from a large, contemporaneous cohort of consecutive CDI cases, and determine their relationship to the C. difficile population structure defined by MLST. This would facilitate study of the evolutionary mechanisms among C. difficile isolates representing a clearly defined collection of co-circulating strains, as well as the estimation of genotype pathogenic potential based on PaLoc tcdB-RBD and tcdC similarity to known hypervirulent genotypes. The size of the study and density of sampling in a single geographic location provides a baseline of C. difficile MLST data. C. difficile genotypes can vary with host species, geographic location and over time, [41][42][43]. Our data set, together with the inherent inter-laboratory comparability and portability of all MLST data, (http://pubmlst.org/cdifficile) will help facilitate comparative studies to understand the reservoirs of human CDI, its international transmission and the evolutionary origin of hypervirulent strains.

Results
C. difficile has a clonal population structure A total of 69 STs were identified among the 1290 clinical isolates, 36 of which are described for the first time in this study.
The relative abundance of the STs is summarised in Table 1, with additional details on the frequency of tcdB-RBD and tcdC alleles in Table S1. PCR-ribotype data representing each ST are presented in Table 1, Table S3 [22], and one (ST-30) was identified in a separate study of infants (data not shown). All 78 STs were included in the analysis of C. difficile population structure.
The total number of variable nucleotide sites was 127/3501 (3.6%), and amino acids 30/1167 (2.6%). The MLST loci were under strong conservative selection (dN/dS ,1, Table S2) as expected for housekeeping genes. The sequences of each ST were concatenated and analysed using Neighbour-Net [44]. Five clades of closely related isolates were identified, representing deep branches of the phylogenetic tree (Fig. 1A). These clades were described previously [22], and although 36 new STs were identified in this study, they all fell within one of the five clades. The relative positions of the hypervirulent ST-1 (027) in clade 2 and ST-11 (078) in clade 5 are shown in Fig. 1A. Extensive networks were found in the ancestry of clade 1 (Fig. 1A), which suggest either homologous recombination or a lack of information to resolve these branchings. The relationships among the STs on the basis of allelic profile is shown by eBURST [45] (Fig. 1B). This analysis did not indicate the presence of many discrete clonal complexes not apparent by nucleotide sequence-based methods.
The five clades were also reconstructed by ClonalFrame [46], which accounts for the effect of recombination when reconstructing the genealogy (Fig. 1C). ClonalFrame was used to infer the numbers of point mutation and homologous recombination events in the C. difficile population. Recombination occurred approximately ten times less often than mutation (r/h = 0.08 with credibility interval [0.04;0.13]), and introduced approximately four times fewer substitutions than mutation (r/m = 0.25 with credibility interval [0.12;0.42]).

Detection of the Pathogenicity Locus
The PaLoc was detected by PCR using oligonucleotide primers which amplify the tcdB-RBD and tcdC loci ( Fig. 2A). Absence of the PaLoc was confirmed using the lok1/lok3 primer pair which bind chromosomal DNA either side of the ,19.6kb PaLoc ( Fig. 2A) [47]. The lok1/3 PCR amplifies 769 bp in the absence of the PaLoc, and thus identifies non-toxigenic isolates. A negative lok1/ 3 PCR in combination with positive tcdB-RBD and tcdC PCRs confirmed an isolate was toxigenic. A positive lok1/3 PCR and negative tcdB-RBD and tcdC PCRs indicated an isolate was nontoxigenic. All isolates conformed to either of these two patterns confirming that the PaLoc was present only in the previously described genomic location [47].
A total of 18 non-toxigenic isolates with seven STs were identified. Some STs had both toxigenic and non-toxigenic isolates

Genetic variation within the tcdB-RBD
The PaLoc position of the tcdB gene, its functional domains and the region sequenced are summarised in Fig. 2A and B. tcdB-RBD sequences (597nt) were determined for all isolates. A total of 17 different alleles were identified. Each was assigned a number in the order of discovery and the sequences made available at http:// pubmlst.org/cdifficile. The association of tcdB-RBD alleles and clades was congruent (Fig. 3A).
Clade 2 (containing ST-1 [027]) was most heterogeneous in terms of its tcdB-RBD alleles (Fig. 3A), the five tcdB-RBD alleles occurring in various combinations with four STs (Table 1); ST-1 (027), ST-41, ST-67, and one ST published previously ST-32 [22]. All ST-1 (027) isolates (n = 448) contained the expected divergent tcdB-RBD8 sequence [30,31], which clusters with clade 2associated tcdB-RBD13 and tcdB-RBD15 (Fig. 3A). Two clade 2 alleles (tcdB-RBD10 and 16) were located on a separate branch of the neighbour joining tree (Fig. 3A). They appear to have a complex admixed ancestry, with some polymorphism typical of  Table S1, to show the frequency of the tcdB-RBD and tcdC alleles associated with each ST. 1 PCR-ribotypes found in association with each ST. An ST can have more than one ribotype, however, the converse is also true and this, together with the numbers of isolates that were PCR-ribotyped is shown in Table S2 and Fig. S1. doi:10.1371/journal.pone.0019993.t001 clade 2 and clade 3 tcdB-RBD sequences, as well as a number of polymorphisms unique to these two alleles (Fig. 3B).

Genetic variation within the tcdC negative regulator
The PaLoc location of the tcdC gene is shown in Fig. 2A and C. tcdC sequence data (552nt) from the initiation codon to codon 184 of the 233 amino acid protein were determined for all isolates. This sequence spans all previously described truncations and deletions within the dimerization domain [14,33,34]. A total of 26 different alleles were identified (Fig. 4). Each allele was assigned a number in the order of discovery and the sequences were made available at http://pubmlst.org/cdifficile. The 26 different tcdC variants included nine of 15 tcdC alleles described previously [34] and a further 17 alleles unique to the present study. The relationship between the tcdC alleles and the five clades identified in this population was examined using a neighbour joining tree. A mostly congruent association was demonstrated, the one exception being allele tcdC-12 which is found in ST-3 (clade 1), but has a sequence similar to the tcdC of clade 2 (Fig. 4A).
The tcdC variants were categorised and named as follows All of the tcdC alleles in clades 1 and 4 isolates were wild type, lacking premature termination codons. However, two clade 1 alleles contained an 18 nucleotide dimerization domain deletion (Fig. 4). Both were derived from the most abundant tcdC-RBD allele, (WTtcdC-3), but had different 18 nucleotide deletions (Fig. 4, Table 1).
The nucleotide locations of all six D18 sequences identified (Fig. 4) were difficult to define precisely because they occur within repetitive sequences and more than one equally likely sequence alignment could be generated. However, ClustalW2 alignments   showed that D18TcdC-25 was unique in having its D18 displaced by 9 nt relative to the other D18 containing alleles (alignment not shown).
Truncation of the TcdC protein relatively close to the N-terminus occurred in all clade 3 isolates, by the same nucleotide substitution (TAAstop) as seen in ST-11 (078), the only member of clade 5. This truncation was unique to clade 3 and ST-11 (078), which had closely related tcdC genes (Fig. 4A). Clade 3 and ST-11 (078) also had deletions within the untranslated nucleotide sequences of the coiledcoil domain (Fig. 4B); D18 and D54 in clade 3, and D36 in ST-11 (078). The similarity between ST-11 (078) and clade 3 within the two PaLoc loci sequenced ( Fig. 3 and Fig. 4) was surprising, given the high divergence of the ST-11 (078) MLST loci from other known genotypes (Fig. 1). It raises the possibility that the ST-11 (078) PaLoc at least, was acquired by homologous recombination.
Almost all STs occurred with a single tcdB-RBD and tcdC variant; only a few had low frequency variants ( Table 1). The two PaLoc loci did have higher dN/dS values than the housekeeping loci, but this was still much less than 1, and therefore not indicative of diversifying selection (Table S2). Overall, tcdB-RBD and tcdC variants were highly predictive of clade, and in clade 4 and clade 5 also predictive of ST.

Discussion
The population structure of a large (n = 1290), recent collection of clinical C. difficile isolates, representing a population unit of circulating strains, was defined using MLST. The sequences of two loci (PaLoc tcdB-RBD and tcdC) putatively linked to hypervirulence [18,24,30,31,33,34] were determined, mapped onto the population structure, and used to examine the underlying evolutionary mechanisms. Our data confirm the clonal population structure of C. difficile [22,35,36] and demonstrate a largely congruent association between clade and PaLoc tcdB-RBD and tcdC variants. Only occasional deviations from congruence were identified due to recombination events. STs sharing the same PCR-ribotype were in most cases closely related (supported by bootstraps, Fig. S1 and Table S3), further supporting the clonal population structure. These observations are in agreement with previous suggestions based on nucleotide sequences [34,36,38], toxinotyping (a RFLP-PCR based method in which two PCR amplified fragments from the tcdB and tcdA genes undergo restriction digest to give characteristic banding patterns) and PCR-ribotyping [48,49]. However, since MLST data allow the precise phylogenetic relationships among genotypes to be visualised (Fig. 1) the present study demonstrates that specific PaLoc variants (Fig. 4) are cladeassociated. The ability to cluster genetically related isolates may provide greater power in future studies aiming to investigate associations between clinical disease severity and genotype.
The five clades defined by traditional phylogenetic approaches (Fig. 1A, 1B and [22]) were supported by ClonalFrame analysis (Fig. 1C, Fig. S1). ClonalFrame showed that recombination had an effect approximately four times lower than point mutation (r/ m = 0.25 with credibility interval [0.12;0.42]). This is consistent with a previous estimate of r/m = 0.2, [50] also based on MLST data [35]. A significantly higher value of r/m between 0.63 and 1.13 has also been reported in the deep phylogeny of C. difficile based on whole genomes [37]. The authors suggested that this difference may reflect recombination rates that are lower in housekeeping genes than the genome as a whole.
Although clade 1 contained by far the highest number of STs, further work studying additional isolates from diverse sources may identify additional genotypes within the other clades. STs submitted by other laboratories to the MLST database (http:// pubmlst.org/cdifficile) suggest this is the case, the exception currently being clade 5, containing only ST-11 (078). The high frequency and large number of different clade 1 genotypes (Table 1) implies that this clade may be particularly well adapted to humans, and therefore potentially sampled most frequently.
All isolates were cultured from ELISA positive stools (indicating the presence of toxin A and, or toxin B) and screened for the PaLoc by lok 1/3 PCR [47]. Eighteen non-toxigenic isolates were identified suggesting either simultaneous colonisation with a toxigenic strain, or an unreliable false positive ELISA test result, which may occur in as many as 20% of cases. All isolates that contained the Paloc genes tcdB and tcdC were negative for the lok1/3 PCR [47], indicating that despite the high mobility of the C. difficile genome [51], the PaLoc (in this clinical isolate population) remains in the same chromosomal location defined 14 years ago [47]. This, together with the observation that tcdB-RBD and tcdC sequences are largely congruent with clade, may indicate that the PaLoc inserted into the genome once, prior to the divergence of the clades. Subsequent homologous recombination may have imported the divergent tcdB sequences found in clade 2 from another Clostridial species possibly on more than one occasion. Consistent with this, the tcdB of C. difficile strain 8864 is divergent throughout its length (GenBank AJ011301; [52]) and is closely related to both the tcdB-RBD found in ST-1 (027), and the tcdB N-terminal catalytic domain of ST-37 (017, A-B+, clade 4) [53]. TcdB sequences are therefore either 8864-like or CD630-like (ST-54, clade 1, [51]), or mosaics of the two. An alternative explanation for the observed congruence of clade and PaLoc is that the PaLoc inserts in a nucleotide sequence and lineage specific manner, possibly in the form of a clade-specific bacteriophage. The latter is supported by the occurrence of nontoxigenic STs throughout clade 1, and in clade 4 (Fig. 1). Three STs had both toxigenic and non-toxigenic variants (Fig. 1). Interestingly, the eBURST diagram (Fig. 1B) [45] showed that the three STs (ST-3, ST-7, and ST-48) identified in both toxigenic and non-toxigenic form were single locus variants clustering closely together. This may indicate PaLoc instability within a common genetic background.
The tcdB-RBD and tcdC loci of clade 5 ST-11 (078) were closely related to clades 1 to 4 ( Fig. 3 and Fig. 4), in contrast to its MLST loci which were divergent from the other clades (Fig. 1). Furthermore, the PaLoc tcdC of clade 3 and clade 5 ST-11 (078) uniquely shared the same nucleotide substitution that truncates the protein. This raises the possibility that clade 3 may, (like clade 5 hypervirulent ST-11 078) have high virulence potential, a hypothesis that will be tested using clinical severity data collected for these isolates. Clade 3 is associated with CDI, causing 49 cases during the study (3.8%), compared to 27 (2.1%) cases due to . National surveillance data for England show that clade 3 associated PCR-ribotype 023 was endemic in the South during this study period [17], the incidence peaking at ,18% (London region, April to June 2007). PCR-ribotype 023 represented 43 of 2030 (2.1%) UK isolates collected during the 1990s [54] and has been detected in Poland and Finland [55,56].
Truncation of TcdC occurred by two different mechanisms; a single nucleotide deletion (in some clade 2 isolates) and a single nucleotide substitution (common to all members of clades 3 and 5) (Fig. 4). The evolution of this truncation at least twice may indicate evolutionary convergence due to a common selective advantage. These three clades are associated with clinically more severe disease relative to clades 1 and 4 (data not shown).
Three STs in clade 1 (ST-9, ST-10 and ST-51) are of interest as they occur with both wild type tcdC, and a coiled-coil domain D18 (Fig. 4, Table 1). This deletion did not impact on tcdC function in a tcdA-b-glucuronidase reporter fusion constructed in C. perfringens [14]. The naturally occurring paired D18tcdC and wild type tcdC variants we describe could be used to confirm these observations. Our mutants harboured the D18 nt in two different locations, suggesting the nucleotide repeats of the coiled-coil domain are unstable, with the 18 nt deletion arising more than once (Fig. 4).
We intend to test the hypothesis that specific STs and, or PaLoc variants are associated with more or less severe disease, using clinical data collected for this large cohort of CDI cases. Data on the relative pathogenicity of different genotypes would assist patient management, targeting of infection control resources and the identification of emergent hypervirulent strains. This large data set provides a framework for further study of C. difficile population biology, and establishes a baseline against which isolates from different hosts and geographic regions can be compared, to understand the sources and evolutionary origins of C. difficile strains that currently cause infection in humans.

Ethics Statement
This study focused only on characterising C. difficile isolates that were archived on an ongoing basis. As this study did not use any patient data, the research ethics committee advised that ethical approval was not required. were retained, and contained sufficient faecal sample for culture, performed as in [22]. The routinely submitted faecal samples were obtained from both hospital and community patients. The size of the population served is approximately 600,000, which represents around 1% of the UK population. When more than one faecal sample received from a single patient yielded isolates of the same genotype, only the first isolate was included. A total of 1290 isolates were available for study, representing 1217 patients and 1277 episodes of diarrhoea (based on a 14 day de-duplication).

Genotyping
MLST and PCR-ribotyping were performed as described previously [22]. The composition of all PCRs was as before, [22] with additional oligonucleotide primers as follows. Absence of the PaLoc was confirmed using PCR primer pair lok1 and lok3 [22,42] (Fig. 2). The tcdB-RBD fragment was amplified and sequenced using oligonucleotide primer pair tcdB3 59-GTAGTTGGATGGAAR-GATTTAG-39 and tcdB4 59-CATCYAAAGTATTTTGAT-GTGC-39 (712bp amplicon). Amplification conditions were 95uC for 15s, followed by 35 cycles of 94uC for 30 s, 50uC for 40 s, and 72uC for 1 min 10 s, then 72uC for 5 min. The tcdC sequence was amplified and sequenced using primer pair tcdC-F1 59 AATTTT-TAGTCAACTAGTTATTTTAAG-39 (located 75 nt upstream of the tcdC initiation codon) tcdC-R1 59-TATAGTTCCAGCACT-TATACCTC-39 (688 bp amplicon). Amplification conditions were 95uC for 15s, followed by 35 cycles of 94uC for 30 s, 59uC for 30 s, and 72uC for 1 min, then 72uC for 5 min. High throughput nucleotide sequencing was performed as described [22]. MLST or tcdB-RBD and tcdC sequencing of all the isolates giving a newly identified allelic profile (ST) or allele nucleotide sequence was performed at least twice, each time using newly extracted DNA from the isolate to confirm the result.

Phylogenetic Analysis
Manual alignments of nucleotide sequences containing deletions were prepared using BioEdit Sequence Alignment Editor [57], and using the program ClustalW2 (http://www.ebi.ac.uk/Tools/ clustalw2/index.html). Neighbour joining trees were constructed using MEGA version 4 (available from http://www.megasoftware. net/) [58]. Phylogenetic networks were constructed using Neighbour-Net (part of the SplitsTree4 software package, http://www.splitstree.org) [59]. eBURST [45] was used to investigate relationship among STs on the basis of alleleic profiles. ClonalFrame analysis [46] was performed by preparing an extended multi-FASTA file containing one representative of each of the 78 STs. ClonalFrame reconstructs genealogies in a similar fashion to traditional phylogenetic techniques, with the difference that it detects, quantifies and accounts for the effect of homologous recombination. ClonalFrame was run for 100,000 iterations, the first half of which was discarded to allow for convergence. Convergence and mixing were found to be suitable by comparison of four independent runs. Figure S1 Clonal population structure is supported by clustering of STs sharing the same ribotype. ClonalFrame analysis of all 78 STs as shown in Fig. 1C. PCR-ribotypes which occurred with more than one ST (Table S3) [22]. STs occurring with more than one PCR-ribotype and PCR-ribotypes occurring with more than one ST are shown in the table. The following PCR-ribotypes occurred with one ST (