Genomics of Clostridium taeniosporum, an organism which forms endospores with ribbon-like appendages

Clostridium taeniosporum, a non-pathogenic anaerobe closely related to the C. botulinum Group II members, was isolated from Crimean lake silt about 60 years ago. Its endospores are surrounded by an encasement layer which forms a trunk at one spore pole to which about 12–14 large, ribbon-like appendages are attached. The genome consists of one 3,264,813 bp, circular chromosome (with 26.6% GC) and three plasmids. The chromosome contains 2,892 potential protein coding sequences: 2,124 have specific functions, 147 have general functions, 228 are conserved but without known function and 393 are hypothetical based on the fact that no statistically significant orthologs were found. The chromosome also contains 101 genes for stable RNAs, including 7 rRNA clusters. Over 84% of the protein coding sequences and 96% of the stable RNA coding regions are oriented in the same direction as replication. The three known appendage genes are located within a single cluster with five other genes, the protein products of which are closely related, in terms of sequence, to the known appendage proteins. The relatedness of the deduced protein products suggests that all or some of the closely related genes might code for minor appendage proteins or assembly factors. The appendage genes might be unique among the known clostridia; no statistically significant orthologs were found within other clostridial genomes for which sequence data are available. The C. taeniosporum chromosome contains two functional prophages, one Siphoviridae and one Myoviridae, and one defective prophage. Three plasmids of 5.9, 69.7 and 163.1 Kbp are present. These data are expected to contribute to future studies of developmental, structural and evolutionary biology and to potential industrial applications of this organism.

Introduction Bacterial endospore appendages are both common and highly diverse in structure, including ribbons, pili, feathers, brushes, tubules and swords (reviewed in [1,2]). Moreover, their formation is highly variable even among closely related organisms. Different strains of the same species might or might not form appendages and different structural types can be formed by different strains of the same species [1]. Of special interest are the spore appendages of Clostridium taeniosporum. This organism, a Gram-positive, non-pathogenic anaerobe isolated from Crimean lake silt, is unique because its spores are surrounded by a thick "encasement" layer which forms a trunk at one spore pole from which about 12-14, large, flat, ribbon-like appendages emanate [3,4,5,6]. The appendages-about 4.5 μm in length, 0.50 μm in width and 30 nm thick-are composed of smaller tennis-racket-like complexes (fibrils) (heads about 5 nm in diameter attached to tails about 1-2 nm in diameter and 40 nm in length) arranged in parallel rows with the heads forming one surface of the appendage [4,6]. The smaller complexes are composed primarily of three proteins, two molecules of nearly identical 29 kDa paralogs and one molecule of a 37 kDa glycoprotein [6]. The 29 kDa proteins are thought to form the heads and the glycoprotein, which contains a collagen-like domain, is thought to fold back on itself into a triple-stranded, right-handed cylinder to form the tails [7,8,9,10]. (The apparent difference between the 30-nm thickness of the appendages and the 40-nm length of the fibril tails is likely the result of different methods of preparation or bending of the fibril tails in the appendages.) Synthesized late in sporulation, the ribbons are coiled into a stalk-like structure attached to the spore pole near the mother cell mid-point and are so large that the stalk occupies most of the mother cell interior [4]. Each ribbon contains about 50,000 to 100,000 complexes and the complete set of appendages is assembled from about 600,000 to 1,200,000 molecules of each of the principal component proteins.
This organism is interesting for many reasons. First, study of the appendage function might contribute to microbial ecology. Perhaps the appendages serve as adhesive organelles to maintain spores in favorable anaerobic environments or perhaps to facilitate dispersal into new habitats. Second, fibril and appendage assembly studies should contribute to structural biology. Third, developmental biology problems of how appendages are positioned on one spore pole and how their size and shape are determined might be approached. Fourth, the evolutionary relationships of the non-toxigenic C. taeniosporum to its closest relatives, the neurotoxigenic C. botulinum Group II members [11], should be instructive. Fifth, the potential use of spores or purified appendages as surface display hosts in vaccine production, for drug delivery into hypoxic environments, and in nanobiotechnological applications should be explored. Finally, Gonchikov [12] has proposed that eukaryotic cells could have arisen from a clostridial cell which forms spores with ribbon-like appendages engulfing a euryarchaeon in an endosymbiotic process. To provide the basis for study of these and other interesting problems, the genome of C. taeniosporum has been sequenced and annotated.

Genome general features
The C. taeniosporum chromosome is a circle of 3,264,813 bp (Fig 1) with a total of 2,892 potential protein coding regions covering 84.03% of the chromosome. Of these, 2,271 can be assigned specific (2,124) or general (147) functions (Table 1. The remaining 621 have unknown functions, of which 393 are hypothetical genes, based on the fact that database searches did not reveal a match with a cutoff E value of 10 −5 or less [14], and might be unique to C. taeniosporum. A total of 62 genes encode transposases (10 in the IS256 family [15]) or The C. taeniosporum spore and chromosome. The spore was observed by scanning electron microscopy as described [6]; the background was blackened by Photoshop. Photographs of other spores have been published [6,11]. From the outside, circle 1 represents the chromosome in Kbp. Circles 2 and 3 represent potential protein coding sequences transcribed clockwise and counterclockwise in shades of green and red, respectively. The different shades of green and of red are assigned randomly to coding sequences, therefore, some adjacent, but otherwise unrelated, coding sequences have the same shades. Circle 4 includes blue and orange bars representing genes related to mobile elements transcribed clockwise and counterclockwise, respectively. Circle 4 also includes three prophages (labeled CtØ1, CtdØ2 (defective) and CtØ3). Circle 5 contains green and red bars to represent rRNA and tRNA genes transcribed clockwise and counterclockwise, respectively. Circle 6 (yellow background) represents GC percentage; the outermost and innermost edges of the yellow circle represent 50 and 20% GC, respectively; the red line is the C. taeniosporum average, 26.6%. Circle 7 (green and red backgrounds) shows GC skew [(G-C)/(G+C)] from +0.55 (outermost edge of green circle) to -0.55 (innermost edge of red circle). Locations of oriC and the spore appendage other proteins related to mobile elements (S1 Table) and are included in the Table 1 Replication/Repair/Recombination functional category. Included also are 101 stable RNA genesseven rRNA gene clusters and 78 tRNA genes (S2 Table). Although some clostridia have selenocysteine tRNA genes [16], C. taeniosporum apparently has neither the tRNA-Sec gene nor the sel operon (discussed below). The chromosome is composed of 26.6% GC, typical of clostridia [17,18], with tight distribution around the average, except for the seven ribosomal RNA gene clusters in which the GC percentage is markedly higher. The putative origin of replication, oriC, identified by (1) the similarity of its sequence to origins of other Gram-positive bacteria, (2) GC skew and (3) the direction of transcription of individual genes [19,20,21,22], is proposed to consist of two untranslated DnaA Box clusters bracketing the dnaA gene. A similar region of the Bacillus subtilis chromosome, even with the central dnaA gene deleted, is an autonomous replicating sequence [23]. Bacterial leading strands often contain more G's than C's, a fact which is useful in identifying origins and termini [24,25]. C. taeniosporum genes (App) are indicated. The map was generated by Circos 0.56 [13]. Bp 1 is the first bp of the first DnaA box of oriC. GC percentage was plotted every 5,000 bp; GC skew was measured over 10,000 bp windows resampled every 5,000 bp. CpG Islands 1.1 did not detect genomic islands. replichores 1 and 2 are clearly marked by almost entirely positive and negative values with averages of +0.254 and -0.238, respectively (Fig 1). Replichore 1, replicated clockwise, is also transcribed predominantly clockwise (87.3% of the CDSes); replichore 2, replicated counterclockwise, is also transcribed predominantly counterclockwise (81.7% of the CDSes). All seven ribosomal RNA gene clusters and 74 of the 78 tRNA genes are also oriented with the replication direction. This preferential orientation of genes with the replication direction [26, 27, 28] has the advantage of avoiding head-on collisions of replication and transcription complexes [29,30]. Single copies of the appendage genes are located in one cluster. Three prophages are located within the chromosome and three extrachromosomal plasmids totaling 241.3 Kbp are present also. C. taeniosporum is among the relatively rare clostridia which neither synthesize selenoproteins nor incorporate selenium into 2-selenouridine in tRNAs [31].

Replication origin
The putative oriC was identified by the orientation of genes in two replichores, base composition asymmetry, the presence of DnaA boxes and the dnaA gene and the locations of genes frequently found near known origins [19]. oriC is on a 9.4 Kbp region which contains rnpA, rmpH, oriCI, dnaA, oriCII, dnaN, recF, orf68, recF, orf87, gyrB and gyrA, similar to the gene organization of the origins of Gram-positive organisms [20,21,22]. oriC is proposed to consist of two untranslated DnaA box clusters bracketing the dnaA gene. oriCI is an untranslated 420 bp sequence containing ten putative DnaA binding sites which match the consensus (TTATC-CACA for low G+C Gram positive Firmicutes) [32,33] in at least 8 of the 9 positions and also two direct repeats (Fig 2). oriCII is also untranslated and consists of 234 bp containing five DnaA boxes (at least 8 matches to the consensus) and an AT-rich, potential DNA Unwinding Element (50 AT pairs within a 53 bp region) near the 3' end (Fig 2). The presence of direct repeats and the DNA Unwinding Element is also characteristic of origins. The nucleotide sequence is very similar to that of the closest relative, C. botulinum B strain Eklund 17B (Gen-Bank Accession NC_010674), except that one oriCI DnaA box in the latter organism matches the consensus in 7, rather than 8, positions (Fig 2). Although the oriCII region alone is capable of autonomous replication in some organisms [34,35,36], autonomous replication of an oriC plasmid in B. subtilis requires both oriCI and oriCII, but not the dnaA gene itself [23].

Sporulation in the clostridia
The overall process of forming spores under control of the sigma cascade started by phosphorylated Spo0A is basically similar in Bacillus subtilis and in the clostridia, but there are many differences between sporulation in B. subtilis and the clostridia especially in the control, as reviewed recently by Al-Hinai et al. [37]. The clostridia are a very diverse group. Collins et al.
[38] described about twenty clusters of the clostridia and Yutin and Galperin [39] have proposed that the Clostridia should include also the Negativicutes (Gram positive bacilli which form spores and have evolved to form also Gram negative envelopes and phenotypes). Even within the C. botulinum species, there are four groups; the members of each group are closely related to each other, but distantly related to members of the other three groups [40]. C. taeniosporum is a non-toxigenic member of the C. botulinum Group II [41]. Given such diversity, it is not surprising that major patterns of controlling spore formation differ within the clostridia and between clostridia and B. subtilis. First, nutrient deprivation is the signal to sporulate in B. subtilis and in some clostridia [37], but in the solventogenic clostridia, the accumulation of organic acids and lower pH are thought to initiate spore formation even in the presence of excess nutrients [42]. Second, the first observable morphological change in clostridia is a shift from uniform bacilli to the swollen, rounded clostridial cell form [43], a form not observed in Bacillus [37]. A third major difference between Bacillus and the clostridia is the mechanism of Spo0A activation. In B. subtilis, histidine kinases phosphorylate phosphorelay proteins which transfer the phosphate to Spo0A to start the sporulation sigma cascade (σ F , σ E , σ G , σ K ) [37]. In the clostridia, Spo0A is phosphorylated directly by orphan histidine kinases (i.e., those without cognate response regulators) without participation of phosphorelay proteins. Moreover, the details of the Spo0A activation differ among the different clostridial species in the number and identity of the histidine kinases [37]. Fourth is the role of σ K . In B. subtilis, σ K functions late in the mother-cell [44]. In several clostridial species, σ K functions both late in the mother cell and also prior to stage II (asymmetric septation) [37]. It is required for Spo0A synthesis in C. acetobutylicum and at least one strain of C. botulinum [45,46]. Finally, the control of expression of the σ H gene appears to differ between B. subtilis and at least one Clostridium. In the former, σ H is involved in the transition from exponential to stationary phase [47] and in the expression of a histidine kinase which initiates the Spo0A phosphorylation pathway [48]. σ H expression, from both σ A and σ H -dependent promoters, is then up-regulated by activated Spo0A [49]. In C. acetobutylicum, the sigH gene is expressed from a σ A -dependent promoter and its expression level is higher throughout the culture cycle than that of the general transcription sigma A [50]. Finally, the nature of spore appendages varies from species to species in both genera. In C. taeniosporum, sporulation begins even in the presence of excess nutrients, making it typical of most clostridia. At least 89 C. taeniosporum genes code for spore components, including appendage proteins, or for regulatory factors.

Appendage genes and proteins
The spore appendages, composed of small, tennis racket-like fibrils arranged in parallel rows [3,4,5,6], are composed principally of three proteins, two 29 kDa isoforms designated P29a and P29b and a glycoprotein of 37.5 kDa (deglycosylated) containing a collagen-like region and designated GP85 [6]. The genes for these and closely related proteins are single copy, adjacent, located on a 9.6 Kbp region of the chromosome in the order: CRD1, HYPO1, HYPO2, HYPO3, CRD2, P29a, GP85, P29b, and CL2 ( Fig 3) and are all transcribed in the same direction. The CRD symbols indicate that the (deduced) protein products are conserved and contain repeat domains but without known functions. The HYPO symbols indicate the genes are hypothetical coding sequences without significant homology to known proteins. The CL2 symbol indicates that the deduced protein also contains a collagen-like region.
All the deduced protein products of these genes, with the exception of HYPO1, are highly related. First, many of them share extensive sequence similarity (Fig 3; S3 Table) with one or more other proteins of this group. CRD1 and CRD2 (564 and 413 residues, respectively) share 399-residue regions which are 36.6% identical and 68.7% similar. Both P29a and P29b contain 269 residues, 87% of which are identical. Moreover, P29a and P29b share extensive similarity to both CRD1 and CRD2; 225 residues of P29a and P29b (of the 269 residue total) are about 22% identical to a similar region of the CRD1 protein and 258 of their residues are about 29% identical to a similar region of the CRD2 protein. HYPO2 and HYPO3 (158 and 160 residues, respectively) are 34% identical and 65% similar over 140 residues, which constitute most of their lengths, and both share extensive similarity to regions near the C-termini of both the CRD1 and CRD2 proteins. Finally, the GP85 and CL2 proteins are identical over the first 39 residues and collagen-like regions cover 239 and 129 residues in GP85 and CL2, respectively.
Second, six of the nine proteins contain internal repeats, ranging from the shortest 65-to the longest 127-residue repeats in the CL2 and CRD1 proteins, respectively (Fig 3; S3 Table). Third, four of the nine proteins contain the domain of unknown function, DUF11, within repeat regions. This conserved domain [51,52] (http://www.ebi.ac.uk/interpro/entry/ IPR001434) contains about 76 residues and is present in cell envelope proteins, often within internal repeats, of unknown function in a wide range of distantly related prokaryotes, including Archaea. Examples include three spore proteins of the Bacillus cereus group (CrdA, CrdB and CrdC) [53], a Chlamydia trachomatis major outer membrane complex protein [54] and an archaeal Methanosarcina mazei cell surface protein [55].
Fourth, collagen-like regions are present in two proteins. Collagens form connective tissues in higher organisms and contain left-handed helices of repeating GXY sequences which wind around a central axis forming right-handed, triple helical, rod-like structures [7,8]. The GXY repeats often include proline and hydroxyproline as the X and Y residues in higher organisms. Some bacterial and phage structural proteins also contain collagen-like regions of GXY repeats which form stable triple helices, although they lack hydroxyproline [56]. The surface proteins of Streptococcus pyogenes Scl1 and 2 have lollipop shapes with the collagenous regions folding back on themselves to form the rods [56]. The Bacillus anthracis exosporium BclA protein contains a collagen-like region and is similar in shape to the C. taeniosporum appendage fibrils [57]. Some phage structural components also include proteins with collagen-like regions [58,59].
The appendage genes are highly expressed; each cell must synthesize at least 600,000 molecules of each P29a, P29b and GP85 late in sporulation to assemble about 50,000 fibrils for each of the 12 appendages. There are five strong candidates for late mother cell sigma K-dependent promoters in the appendage gene region. All five match the consensus sigma K promoter sequence [a/cACa/c N16 CATA N3 TA] [60,61] perfectly (or with one mismatch), all have the consensus spacing and all contain the most highly conserved AC of the -35 region (Fig 3).
Two of the putative sigma K promoters are located in the intergenic regions upstream of the P29a -GP85 and the P29b -CL2 genes and these pairs likely form operons. The putative sigma K promoter upstream of the P29a gene matches the consensus perfectly and the one upstream of the P29b gene has only one mismatch. A putative sigma A-dependent promoter upstream of the P29b and CL2 genes suggests that this operon is expressed in vegetative growth, as well as late in sporulation. Additionally, three putative sigma K promoters are located within the CRD1, HYPO1 and HYPO3 reading frames (Fig 3). Although promoters usually are located within intergenic regions, sigma A-dependent promoters can be found within reading frames [62]. Two well characterized, low level, constitutively expressed promoters are located within the Escherichia coli trp and ilv reading frames [63,64]. Additional sequences which match the sigma K consensus with two or three mismatches are located within this appendage gene region, but have not been labeled as putative promoters.

Prophages
C. taeniosporum contains two complete and one defective prophages. Prophage CtØ1 consists of 37,424 bp, thirty-six potential coding sequences (10 of which code for hypothetical proteins, mostly conserved among phage proteins) and putative attachment sites (S4 Table). Its GC content is 30.19%. The second prophage, CtdØ2, is likely to be defective; no attachment sites, phage head protein genes or site-specific integrase genes were detected. It consists of only 22,430 bp with 26.26% GC content. Among the 27 potential coding sequences, 10 code for hypothetical, but mostly conserved, proteins (S5 Table). The third prophage, CtØ3, has 28.6% GC and consists of 43,915 bp, 54 potential coding sequences (28 hypothetical, mostly conserved) and potential attachment sites (S6 Table). All the prophage orfs are listed in the Phagerelated functional category in Table 1 and the prophage orfs related to recombination/integration are listed also in S1 Table (Genes associated with mobile genetic elements).
Both the CtØ1 and CtØ3 prophages are functional. After incubation of C. taeniosporum in medium containing mitomycin C, phage particles were observed by electron microscopy in the concentrated culture fluid (Fig 4). The phage observed in greater numbers has a longer, flexible, non-contractile tail (typical of the Siphoviridae); the phage with a shorter, contractile tail was observed much less frequently and is typical of the Myoviridae [65,66]. Based on the number of nucleotides in the tail tape measure protein genes, CtØ3 is likely to be the Siphoviridae; CtØ1 the Myoviridae [67,68]. PCR amplification and sequencing of DNA fragments from the concentrated phage particles confirmed the presence of CtØ3 sequences.
All three prophages are closely related to known clostridial and bacillus phages (Fig 5). CtØ1 and CtdØ2 are very similar to a clade of 16 clostridial phages; CtØ1 is most closely related to Clostridium phage phiCD38-2 (NC_015568); they are 59.2% identical in more than 37 Kbp of sequence. The defective, 22 Kbp CtdØ2 is most closely related to a portion of the 185.7 Kbp Clostridium phage c-st (NC_007581); it is 54.7% identical to the 9.5 to 31.5 Kbp  The nucleotide sequences of the 62 phages known to infect the Clostridium or Bacillus genera (and for which complete nucleotides sequences were available) were subjected a MAFFT multi-wise alignment [69]. A neighbor-joining tree [70] was constructed (Geneious Pro v.7.1.6) and bootstrap values, expressed in percentage based on 1,000 repetitions, are shown next to each group. The bar represents 0.3 change per nucleotide site.

Selenium metabolism
Selenium has three major activities in prokaryotes, incorporation into selenoproteins, incorporation into 2-selenouridine-containing tRNAs, and action as a cofactor in certain molybdenum-containing hydroxylases. The first two of these functions depend on the synthesis of selenophosphate by selenophosphate synthetase, the selD gene product [31,71]. The selenoproteins, common among anaerobes, contain selenocysteine and are required for growth in amino acid media to catalyze Stickland reactions and harvest metabolic energy by the coupled anaerobic oxidation and reduction of amino acid pairs resulting in ATP production by substrate level phosphorylation [72,73,74,75]. The incorporation of selenocysteine into protein requires the products of the selA, selB and selC genes (reviewed in Böck [76]), which are not present in C. taeniosporum. The replacement of sulfur in 2-thiouridine in the tRNAs which contain that modification [77] requires 2-selenouridine synthase, the product of the ybbB gene [78], also apparently missing from the C. taeniosporum genome.
Therefore, this organism is among the relatively rare SelD orphans [31]-organisms which have the selD gene and presumably synthesize selenophosphate by action of selenophosphate synthetase, but neither synthesize selenoproteins nor incorporate selenium into 2-selenouridine in tRNAs. SelD orphans account for about 5% of SelD-containing organisms [31]. This raises two questions. First is the function of selenophosphate synthetase in SelD orphans? Selenium is a labile cofactor in some molybdenum-containing hydroxylases, including xanthine dehydrogenase [79,80,81] and a purine hydroxylase [82], although the structure of the Se is apparently not known [31]. C. taeniosporum contains two copies of a Se-dependent xanthine dehydrogenase (xdh) gene and at least four other genes coding for Se metabolism-linked proteins. Perhaps, SelD is required for incorporation of Se as a labile cofactor in these or other proteins. Second is the energy harvesting mechanism of the anaerobe in the absence of selenoproteins to catalyze Stickland reactions. Perhaps some oxidoreductases involved in energy metabolism also use non-covalently linked Se as a cofactor. For example, the selenium-and molybdenum-containing nicotinic acid hydroxylase of Clostridium barkeri requires a labile form of Se which is directly coordinated with molybdenum [83,84].

C. taeniosporum plasmids
C. taeniosporum contains three plasmids: pCt1, pCt2 and pCt3. pCt1 consists of 5,894 bp and contains genes for a replication protein, a relaxase, a mobilization protein (MobC homolog) and three genes of unknown function (some contain conserved domains) (Fig 6; S7 Table). The putative replication protein is highly similar in sequence to nine plasmid and chromosomal replication factors of a wide range of organisms, including other clostridia, Geobacillus, Pseudomonas, Aeromonas and Yersinia. It is most closely related to the putative replication proteins of the Clostridium perfringens plasmid pSM101B (YP_699929; NC_008264) and of the host Clostridium perfringens SM101 chromosome (YP_697960; NC_008262 [85]) (69% identical over 399 residues to the plasmid protein and 69% identical over 335 residues to the chromosomal gene protein) [85]. It is 42% identical over 395 residues to the Geobacillus stearothermophilus plasmid pGS18 putative replication protein (YP_001716004; NC_010420 [86]). The presence of the potential relaxase and MobC genes might indicate that this plasmid could be mobilized for conjugal transfer.
The pCt1 replicon, present on the 2.4 Kbp fragment containing the 3' end of the relaxase, the replication protein gene rep, the intergenic region downstream of rep and a portion of the peptidase gene, is sufficient for replication in Bacillus subtilis. Plasmid pCt1-2200, which consists of that 2.4 Kbp pCt1 replicon linked to a 1.2 Kbp chloramphenicol-resistance (CM-R) determinant from pAT4, transformed B. subtilis to chloramphenicol-resistance. pCt1-2200 was extracted from the B. subtilis transformants and its structure confirmed. Therefore, the replication gene and the adjacent nucleotides also contain the plasmid replication origin.
pCt2 consists of 69,744 bp and 62 potential orfs, many of which code for likely useful proteins. Genes of a type I restriction/modification system [87] include those for two M and two S subunits; all located within an 8.2 Kbp region. A cytosine-specific DNA methylase gene is present; this enzyme could be involved also in restriction/modification. Four replication genes are present on this plasmid. One likely codes for a plasmid replication factor; the deduced protein is 32% identical over more than 400 residues to the pCt1 replication protein and more than 80% identical (over 437 residues) to five C. beijerinckii and C. botulinum replication proteins. Other close relatives include the Geobacillus stearothermophilus pGS18 and the C. perfringens pSM101B plasmid replication proteins and the replication protein encoded by the C. perfringens strain SM101 chromosome. The other three copies of replication genes on pCt2 are significantly similar to the chromosomal dnaD; two of the DnaD proteins are 59% identical over 403 residues to DnaD encoded by the chromosome. DnaD is required for DNA replication initiation and re-initiation in B. subtilis and probably also in other low GC Gram-positive organisms [88]. Other potentially useful pCt2 genes include those which code for bacteriocin synthesis and immunity, quorum sensing, signal transduction proteins, a sigma 70 gene, Soj (a sporulation initiation inhibitor), transporter/antiporter pairs and three putative drug resistance determinants (kanamycin, bacitracin and a multidrug MATE transporter). A putative altruism determinant is present on this plasmid. The abortive infection (Abi) protein is highly related (E-value 7.0e-113) to known factors which, after phage infection, stop progeny phage production but, in the process, kill the infected host, thereby protecting the un-infected cells [89].
pCt3 consists of 163,055 bp and 154 potential genes. Among these are five genes encoding thiamine biosynthesis enzymes on a 4.7 Kbp region, genes for iron and cadmium translocating systems, seven genes encoding transcription regulators, genes for a type III restriction/modification system [90], and five genes for potential drug resistance. pCt3 also contains two CRISPR-like regions. One consists of about 500 bp of nine identical 30-mers repeated directly, separating variable regions of 34-37 bp. The second consists of about 700 bp of thirteen 30-mers also repeated directly and separating variable regions of 34-37 bp. The direct repeats in the two regions differ in only one of 30 bp. Although orfs on both sides of both repeat regions could code for Cas proteins, none of them is orthologous to known Cas proteins (reviewed by Barrangou [91]). Of special interest are components of toxin-antitoxin systems [92,93] which serve to stabilize the plasmid presence. Potential toxin genes include those for a Fic/DOC family protein [94,95] and the Zeta protein [96]. There is also a potential prevent-host-death antitoxin gene [92,93,94].

Firmicutes and the origin of eukaryotic cells
Theories on the formation of eukaryotic cells can be divided into two major groups, endosymbiotic and autogenous. The former supposes that the eukaryotic nucleus was formed from one prokaryotic cell incorporated by another thereby generating both nucleus and cytoplasm; the latter that differentiation of nucleus and cytoplasm occurred by stepwise changes within a single lineage (reviewed by Martin [97] and Baum [98]). Gonchikov [12] has proposed that a eukaryotic cell could have been formed from an anaerobic clostridial cell which formed spores with appendages and a euryarchaeon by an endosymbiotic process. In this model, the clostridial mother cell cytoplasmic membrane erroneously engulfed a euryarchaeon cell during the engulfment stage of sporulation; the euryarchaeon became the eukaryotic nucleus and the spore appendages became the microtubular mitotic apparatus.

Conclusions
The C. taeniosporum genome consists of a single circular chromosome of 3.26 Mbp, including two prophages and one defective prophage, plus three plasmids and includes numerous genes which code for proteins related to mobile elements, all suggesting that this organism has undergone many genetic exchanges. The three known appendage protein genes are single copy, which is surprising given the huge number of protein molecules needed for assembly of all twelve appendages, and are located in one 9.6 Kbp region of the chromosome along with five other closely related protein genes. The relatedness of the proteins and the proximity of their genes suggest that all those gene products could be involved in appendage production and assembly. Structural and developmental biological studies of the appendages including the mechanism by which they are attached to one spore pole, should, indeed, be very informative. Although C. taeniosporum is thought to be nonpathogenic, it evolved from a common ancestor of the C. botulinum Group II members [11], suggesting that more detailed study of C. botulinum and C. taeniosporum phylogeny and ecology would be useful.

Materials and methods
Strains, culture conditions and plasmid C. taeniosporum strain 1/k was grown in modified CDC anaerobe medium (tryptic soy broth (30 g/l), yeast extract (5 g/l), NaCl (5 g/l), hemin (5 mg/l), vitamin K1 (10 mg/l), and glucose (5 g/l), pH 7.4 with agar (1.5%), as needed) [99] at 30˚C under an atmosphere of 85% nitrogen, 10% hydrogen and 5% carbon dioxide in a Forma model 1025 anaerobic chamber. All solutions were reduced for 24 hr before use. Escherichia coli strain JM109 was grown on LB medium modified to contain 5 g NaCl/l; Bacillus subtilis strain SMY [100] was grown on LB medium. Ampicillin was added to 100 μg/ml and chloramphenicol to 30 μg/ml for selecting drug-resistant transformants of E. coli, chloramphenicol-resistant transformants of B. subtilis were selected on LB medium with 5 μg chloramphenicol/ml. pAT4, constructed by Charles Stewart and provided by Mary Harrison, carries the chloramphenicol-resistance gene of pC194 (NC_002013) [101].

Molecular techniques
Chromosomal DNA was extracted from cells growing exponentially by the Puregene Genomic DNA Purification kit (Gentra Systems, Minneapolis, MN). Plasmids were purified by a QIAprep Kit [QIAGEN Inc., Valencia, CA 91355]. PCR reactions were conducted in a MasterCycler Personal (Eppendorf AG, Hamburg, Germany). Reagents included Taq DNA polymerase from Roche (Branford, CT), deoxyribonucleotides from Sigma-Aldrich (St Louis, MO) for short sequences and the RangerMix from Bioline (Taunton, MA) for amplification of fragments longer than 6 kbp. Transformations were performed by standard methods [102,103].

Genome sequencing and assembly
The C. taeniosporum genome was initially sequenced with 454 Life Sciences (Branford, CT) technology and the reads assembled with the Roche Newbler assembler program version 2.7 (Branford, CT) into 104 large contigs (average length of 33,000 bp). These larger contigs were further assembled with paired-end data into 18 scaffolds with average length of 192,000 bp, although a total of 73 gaps remained within the scaffolds. Additional data to close the gaps and connect the scaffolds were obtained by PCR to generate fragments spanning the gaps and linking scaffold ends followed by Sanger sequencing [104]. The assembly editing program, Consed version 22.0 [105], using the Autofinisher parameter, designed 203 primers and suggested 94 potential pairings; sequencing the resulting products closed 58 of the 73 gaps. An additional 38 primer pairs were designed manually and used to amplify fragments, the sequences of which closed the remaining gaps within scaffolds and joined the 18 scaffolds into 4. The largest was the 3.26 mbp circular chromosome, the remaining 3 were circular plasmids.
The genome was then sequenced by the Illumina method (Illumina, Inc., San Diego, CA) and the reads assembled into contigs, as indicated above. The Illumina data and the Sanger sequences were mapped to the 454 sequence by Geneious to create a consensus. The discrepancies between the Illumina data and the 454 data were resolved in favor of the Illumina data which have a reduced error rate [106] except in those cases in which the discrepancies were in regions covered by both 454 and Sanger sequencing. In those cases, the 454/Sanger data were favored. Overall, there were 63 corrections to the 3.26 mbp chromosome (47 single base deletions; 15 single base additions and one base substitution). There were 6 changes in one plasmid and 1 change in another. The C. taeniosporum chromosome contains 3,264,813 bp; the plasmids 5,984, 69,744, and 163,055 bp.

Coding sequence annotation
Initially, annotation of the chromosome was accomplished in four stages. The first stage was location of potential protein coding regions and identification of stable RNA coding sequences with RAST (http://rast.nmpdr.org/) [107] and the Institute for Genomic Sciences Annotation Engine (http://ae.igs.umaryland.edu/egi/index.cgi). Some individual (deduced) proteins were identified by Blast searches [108,109] [110]. HMMER 3.1 (http://www. hmmer.org) was used. Homology matches with an e-value greater than 1e-5 [14] were discarded. For each CDS, preference for the annotation was chosen in the order CbBO, bactNOG, and Pfam. That is, if no significant match was found in the CbBO database, the bactNOG database was search, and finally the Pfam database. The automated annotation was moderately curated manually.
Searches of the CbBO (ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/Clostridium_botulinum_ B1_Okra_uid59147/NC_010516.ptt) and bactNOG databases also assigned CDSes into functional categories of clusters of orthologous groups (COGs); assignments were reviewed and edited manually with preference given to the CbBO assignment. The functional category list of prokaryotic proteins described by Tatusov et al. [111] was extended to include Phage-related proteins, Sporulation/control/appendages and Drug resistance/bacterial toxins (Table 1).
After the genome sequence was corrected by giving preference to the Illumina data, the annotation was updated by Genbank effective July 13, 2107 by the NCBI Prokaryotic Genome Annotation Pipeline and the manual curation repeated. The annotation uses GeneMarkS + which incorporates both protein alignments and statistical predictions and is an extension of GeneMarkS [112,113]. The update benefits from a combination of changes made in the corrected sequence and changes made by routine improvements to the annotation pipeline.

Prophage induction and electron microscopy
Prophage induction was accomplished by adding 5 μg ml -1 mitomycin C to exponentially growing cultures (absorbance 0.1) and incubating for four hours. The cultures were centrifuged at 24,000 g for 1 hr at 4˚C and the supernatant filtered through 0.45 μm pore diameter Millipore filters. Polyethylene glycol 6000 was added to 10% (w/v) and dissolved. The preparation was incubated at 4˚C for 60 min and centrifuged at 8000 g for 10 min at 4˚C. The precipitated particles were resuspended in 0.01 times the original culture volume of deionized water, in place of the SM buffer used by Oakey and Owens [119]. The phages were observed by transmission electron microscopy. Ten microliters of 2% uranyl acetate were deposited on 200 mesh Formvar/Carbon coated copper grids. After 30 sec, 10 μl of phage preparation were mixed with the stain and, after 30 sec, the grids were gently blotted with Whatman paper and allowed to dry for 2-3 min. The grids were observed with an FEI Tecnai Spirit BioTwin transmission electron microscope operated at 80kV transmission electron microscope.

Accession numbers
Sequences of the chromosome and the plasmids pCt1, pCt2 and pCt3 have been deposited in GenBank with Accession Numbers CP017253, CP017254, CP017255 and CP017256, respectively.

Construction of pCt1-2200
A 2,397 bp fragment of pCt1 carrying a portion of the relaxase gene, the rep gene, the intergenic region downstream of rep and a portion of the amidopeptidase gene was amplified by PCR (forward primer, pCT1oriF1, 5' GCAACTTAGAGAAGGCGAAAACCT; reverse primer, pCTori3p, 5' GGTGGTAAAAACTCAGGCAAAATATCC) and cloned into pGem-T Easy Vector (Promega Corp. Madison, WI. 53711) (selecting for ampicillin-resistance in E. coli) generating pCt1-2010. An 1,197 bp fragment of pAT4 carrying the chloramphenicol-resistance (CM-R) determinant was amplified with primers constructed to contain ApaI, AatII and PstI sites upstream of the CM-R gene and with an SphI site downstream (forward primer pC194CMRF1, 5' AGAGGAGGGCCCGACGTCCTGCAG-GCGCTTAAAACCAGTCATACCA;reverse primer pC194CMRR1, 5' AGAGGAGCATGCAGCCGACCATTCGACAAGTT). The amplified fragment was cut with ApaI and SphI and cloned into the pGem-T Easy Vector polylinker, generating pCt1-2011. The pGem-T Easy Vector region was deleted from pCt1-2011 by cutting with PstI on both sides, ligating the remaining fragment and transforming into B. subtilis strain SMY. The resulting plasmid, pCt1-2200, consists of only the pCt1 replicon and the CM-R gene.
Supporting information S1