Table 1.
Assembled C. difficile genomes from the NCBI SRA and associated statistics.
Fig 1.
Clustering of TcdA and TcdB sequences derived from NCBI GenBank and SRA into subtypes.
(A) Hierarchical clustering of TcdA sequences, split into 8 groups. (B) Neighbor-joining phylogenetic tree of representative sequences of each TcdA subtype. (C) Percentage identities between representative sequences. (D) Hierarchical clustering of TcdB sequences, split into 14 groups. (E) Neighbor-joining phylogenetic tree of representative sequences of each TcdB subtype. (F) Percentage identities between representative sequences. Hierarchical clustering was performed using the hclust() function in R, and cluster definitions were selected based on strong within-cluster sequence similarities and weak between-cluster similarities, as demonstrated visually and quantitatively. The reference strains (VPI 10463 and strain 630) are associated with TcdA group A1 and TcdB group B1. The hypervirulent ribotype 027 strains such as R12087 and R20291 are associated with TcdA group A2 and TcdB group B2. Also included are the homologs of TcdA and TcdB (TcsH and TcsL, respectively) from P. sordellii, which expectedly exhibit the highest divergence from other groups. The datasets include TcdA and TcdB sequences from the NCBI GenBank as well as additional sequences assembled from the SRA.
Fig 2.
Toxin subtypes across the C. difficile phylogeny and occurrence of subtypes in a clinical CDI cohort.
(A) TcdA (inner ring) and TcdB (outer ring) subtypes mapped onto a tree of 1934 C. difficile genomes. The tree is a maximum likelihood phylogeny of NCBI-derived C. difficile genomes based on 14,194 genome-wide SNPs (see Methods). Lineages corresponding to previously identified C. difficile PaLoc clades (1–5) are labeled numerically. Selected clinically relevant strains are shown on the tree, with hypervirulent/epidemic outbreak strains indicated by stars. Asterisks indicate lineages without toxin genes. (B) Frequency of toxin subtypes detected in 1,934 representative, complete C. difficile genomes from NCBI/GenBank. A total of 1,640 (84.8%) C. difficile strains contained TcdA and/or TcdB, while 294 (15.2%) were toxin deficient. (c) Frequency of toxin subtypes detected in a CDI clinical cohort from Brigham and Women's Hospital (BWH). The total dataset contained 351 C. difficile genomes derived from infected patients. Of these, 289 (82.3%) contained toxin genes, and 62 (17.7%) were toxin deficient.
Fig 3.
Evolutionary diversification of TcdB by intragenic recombination and domain shuffling.
(A) Visualization of amino acid variation patterns in TcdB using a newly developed haplotype coloring algorithm (HaploColor). The visualization shows patterns of amino acid variation across the TcdB alignment. In this algorithm, the first sequence (B1.1) is assigned a distinct color, and all other sequences are colored the same color where they match this first sequence. Then, the process is repeated using a second sequence (B7.1) as the new reference, and so on. This reveals multiple colored segments indicative of common ancestry (identity by descent). Mosaic patterns are indicative of intragenic recombination. (B) Phylogenetic trees of TcdB based on individual domains. Each domain tree can be subdivided into two types (labeled 1 and 2), which allows each subtype to be described based on its domain composition (C). This reveals that TcdB subtypes are composed of domains with variable evolutionary histories, indicative of domain shuffling and intragenic recombination. (D) Evolutionary model depicting relationships between subtypes and putative recombination events. Here, TcdB split early into two main groups (i and ii). Subtype B2 likely originated by a recombination event fusing an ancestral type i and type ii toxin. B9 likely originated from recombination between B1 and B2, B3 from recombination between B1 and a type ii toxin, and B8 from recombination between B5 and a type ii toxin.
Fig 4.
Conservation and functional variation across TcdB subtypes.
(A) Frequency of amino acid variants across all positions of TcdB. The height of the bar indicates the number of unique TcdB sequences that contain a substitution relative to the classical TcdB1 (B1.1) sequence from strain 630 and VPI10463. Below this is a plot of amino acid variation for key functional regions including the binding sites for the frizzled receptor (FZD) and the antibodies (E3, PA41, and bezlotoxumab). The alignment is colored gray for residues that match the common amino acid found in B1.1, and variants are colored blue (darkest blue = most common variant). E3 and PA41 binding sites are highly conserved, whereas FZD and bezlotoxumab binding sites are highly variable. FZD and bezlotoxumab variants also co-occur with each other. (B) Evolutionary conservation mapped to the protein structure of full length TcdB based on PDB 6OQ5 [65]. Eight highly conserved surface patches are indicated. Center residues within each surface patch are indicated in bold font.