Whole Genome Sequences of Three Treponema pallidum ssp. pertenue Strains: Yaws and Syphilis Treponemes Differ in Less than 0.2% of the Genome Sequence

Background The yaws treponemes, Treponema pallidum ssp. pertenue (TPE) strains, are closely related to syphilis causing strains of Treponema pallidum ssp. pallidum (TPA). Both yaws and syphilis are distinguished on the basis of epidemiological characteristics, clinical symptoms, and several genetic signatures of the corresponding causative agents. Methodology/Principal Findings To precisely define genetic differences between TPA and TPE, high-quality whole genome sequences of three TPE strains (Samoa D, CDC-2, Gauthier) were determined using next-generation sequencing techniques. TPE genome sequences were compared to four genomes of TPA strains (Nichols, DAL-1, SS14, Chicago). The genome structure was identical in all three TPE strains with similar length ranging between 1,139,330 bp and 1,139,744 bp. No major genome rearrangements were found when compared to the four TPA genomes. The whole genome nucleotide divergence (dA) between TPA and TPE subspecies was 4.7 and 4.8 times higher than the observed nucleotide diversity (π) among TPA and TPE strains, respectively, corresponding to 99.8% identity between TPA and TPE genomes. A set of 97 (9.9%) TPE genes encoded proteins containing two or more amino acid replacements or other major sequence changes. The TPE divergent genes were mostly from the group encoding potential virulence factors and genes encoding proteins with unknown function. Conclusions/Significance Hypothetical genes, with genetic differences, consistently found between TPE and TPA strains are candidates for syphilitic treponemes virulence factors. Seventeen TPE genes were predicted under positive selection, and eleven of them coded either for predicted exported proteins or membrane proteins suggesting their possible association with the cell surface. Sequence changes between TPE and TPA strains and changes specific to individual strains represent suitable targets for subspecies- and strain-specific molecular diagnostics.


Introduction
Yaws is a treponemal disease found in rural communities of tropical regions of Africa, Asia, Oceania and South America, that predominantly affects children under 15 years of age. The causative agent of yaws, Treponema pallidum ssp. pertenue (TPE), was discovered at the same time [1] as the causative agent of syphilis, Treponema pallidum ssp. pallidum (TPA, [2]). The estimated prevalence of yaws is about 2 million cases worldwide [3], however, up to 50 million cases were estimated before 1952 when WHO established 'The Global Yaws Control Program'. Between 1952-1964, more than 300 million people were treated with penicillin and the global prevalence of yaws declined significantly [4,5]. Yaws was almost eliminated, however, local epidemics appeared in several areas in the 1970s [6]. Now, yaws is reemerging in rural populations of Africa, Asia and South America. Since humans are the only known reservoir of yaws and because of the availability of cheap, single-dose penicillin treatment, eradication of this disease remains promising. In fact, renewed plans to eradicate yaws emerged recently [7].
TPE strains are closely related to syphilis causing TPA strains and the two subspecies cannot be distinguished by morphology, protein electrophoresis, bacterial physiology or host immune response [8,9]. The genetic relationship between pallidum and pertenue subspecies was determined in 1980 by measuring the degree of their DNA sequence homology. The data indicated that TPE strain Gauthier and TPA strain Nichols were identical within the limits of resolution of this technique (about 2% of the genome differences; [10]). This led to reclassification of both organisms into a single species [11]. Several genetic differences between both subspecies have been already published including differences in the TP1038 gene at position 122 [12,13], in the 16S rRNA gene (unpublished data, cited in [14]), in the 59-and 39-flanking regions of TP0171 [15], in the gpd gene (TP0257) at position 579 [16], in tp92 (TP0326) [17], and in tprI and tprC [18]. Newly performed analyses on strains belonging to pallidum and pertenue subspecies suggested that the genetic differences in the homologous chromosomal regions were less than 0.5% [19]. Such subtle differences need to be further characterized using high quality whole genome sequencing.
Yaws and syphilis are distinguished on the basis of epidemiological characteristics and clinical symptoms. Although both yaws and syphilis are multi-stage diseases, yaws, unlike syphilis, is mostly characterized by skin nodules, skin ulcerations, joint and soft tissue destruction and bone affections. However, in certain cases, yaws treponemes can also infect the central nervous system, cardiovascular system and fetus [20]. Yaws is an endemic disease of humid and warm rural areas that is transmitted through direct skin contact from an infected patient to a recipient (preferably among children) via a preexisting defect in the recipient's skin barrier. Syphilis is a sexually transmitted (and congenital) disease affecting adults and newborns worldwide. However, the transmission route of syphilis and yaws may reflect epidemiological differences rather than inherent differences between TPA and TPE strains [21]. Differences in the clinical manifestation and epidemiology of both diseases are likely to reflect primary genetic differences in the corresponding genomes.
In this communication, we compare three complete genome sequences of T. p. ssp. pertenue (Samoa D, CDC-2, Gauthier) to genomes of four T. p. ssp. pallidum strains (Nichols, DAL-1, SS14, Chicago). Genetic differences, which are consistently found between all studied strains of TPE and all studied strains of TPA, can differentiate the two pathogens at the molecular level and are candidates for important virulence factors of syphilitic treponemes. Additionally, the identified sequence changes represent suitable targets for specific molecular diagnostics.

Materials and Methods
Isolation of Treponema pallidum chromosomal DNA T. pallidum ssp. pallidum (TPA) strain Nichols and T. pallidum ssp. pertenue (TPE) strains CDC-2, Gauthier and Samoa D were grown in rabbit testes, extracted and purified from testicular tissue using Hypaque gradient centrifugation [22,23]. Chromosomal DNA was prepared as described previously [23].
DNA sequencing and assembly of the TPE Samoa D genome Whole genome DNA sequencing used a combined approach including Comparative Genome Sequencing (CGS, [24]), 454 pyrosequencing [25] and the Solexa approach [26]. Isolated genomic DNA from Samoa D strain was amplified using REPLI-g kit (QIAGEN, Valencia, CA, USA) and the amplified genomic DNA was used for CGS. CGS resulted in 904 identified single nucleotide changes (SNPs) when compared to the TPA Nichols reference genomic DNA. In addition, 102 genomic regions were recommended for dideoxy-terminator sequencing (DDT) due to increased sequence heterogeneity between tested and reference genomes. Parallel to CGS, unamplified Samoa D DNA was sequenced using the 454 technique on a GS20 apparatus (454 Life Sciences Corporation, Branford, CT, USA) to average depth coverage 276. The 454 method had an average read length of 100 bp. These reads were assembled, with a Newbler assembler, into 154 contigs covering 98.6% of the reference TPA Nichols genome (AE000520.1). The Solexa sequencing technology (Illumina, San Diego, CA, USA) was used to improve accuracy of the complete genomic sequence resulting in an additional 756 coverage with an average read length of 36 bp. These reads were assembled, with a Velvet assembler [27], into 230 contigs. All discrepancies in CGS, 454 and Solexa sequences (i.e. 575 sites covered with 361 individual PCR products) were resequenced using the DDT method.
DNA sequencing and assembly of the TPE CDC-2 genome A combination of 454 sequencing (366 coverage, 101 bp average read length, 98.6% coverage of the AE000520.1 genome), Solexa sequencing (746coverage, 36 bp average read length), and DDT technology was used for determination of the complete CDC-2 genome sequence using amplified genomic DNA. All nucleotide discrepancies between Samoa D and CDC-2 genome were confirmed using the DDT method.

DNA sequencing and assembly of the TPE Gauthier genome
Amplified genomic DNA of TPE Gauthier was sequenced using 454 (266 coverage, 101 bp average read length, 1508 contigs, 92.6% coverage of AE000520.1 genome), and Solexa methods (656 coverage, 36 bp average read length, 902 contigs). However, due to substantial contamination of the TPE genomic DNA with rabbit DNA (data not shown) and more than 1500 gaps in the assembled genome sequence, chromosomal Gauthier DNA was amplified in 134 overlapping PCR intervals (Treponema pallidum intervals -TPI, described in [19,28]) covering the entire treponemal genome using a GeneAmpH XL PCR kit (Applied Biosystems, Forster City, CA, USA). To avoid misassembly of sequentially related genes, equimolar PCR products were combined into four pools that were sequenced as different samples on a 454 platform and one sample

Author Summary
Spirochete Treponema pallidum ssp. pertenue (TPE) is the causative agent of yaws while strains of Treponema pallidum ssp. pallidum (TPA) cause syphilis. Both yaws and syphilis are distinguished on the basis of epidemiological characteristics and clinical symptoms. Neither treponeme can reproduce outside the host organism, which precludes the use of standard molecular biology techniques used to study cultivable pathogens. In this study, we determined high quality whole genome sequences of TPE strains and compared them to known genetic information for T. pallidum ssp. pallidum strains. The genome structure was identical in all three TPE strains and also between TPA and TPE strains. The TPE genome length ranged between 1,139,330 bp and 1,139,744 bp. The overall sequence identity between TPA and TPE genomes was 99.8%, indicating that the two pathogens are extremely closely related. A set of 34 TPE genes (3.5%) encoded proteins containing six or more amino acid replacements or other major sequence changes. These genes more often belonged to the group of genes with predicted virulence and unknown functions suggesting their involvement in infection differences between yaws and syphilis. on a Solexa platform. 454 and Solexa sequencing resulted in 506 coverage (average read length 233 bp, 78 contigs, 98.5% coverage of AE000520.1 genome) and in 2066coverage (35 bp average read length, 80 contigs assembled using Velvet; [27]), respectively. All gaps in the complete sequence were filled using the DDT method.

Identification of genetic heterogeneity within TPE genomes
The Solexa reads belonging to individual genomes were used for identification of genetic heterogeneity within TPE genomes. In brief, individual Solexa reads were aligned to the final genome sequence using the Burrows-Wheeler Aligner (BWA) [31]. For downstream analyses and variant callings, the Samtools package was applied [32].

Whole genome fingerprinting (WGF)
To verify final genome assemblies, whole genome fingerprints [19,28] were compared to the in silico restriction enzyme analysis of each sequenced genome. Three enzymes, BamH I, EcoR I and Hind III, were used for restriction analysis of all amplicons. The use of other enzymes was optional depending on the length of restriction fragments and the availability of restriction target sites. The average error rate of WGF for Treponema paraluiscuniculi strain Cuniculi A genome was calculated previously [33] and corresponded to 27.9 bp (1.6% of the average fragment length) with a variation range between 0 and 132 bp.

Gene identification and annotation
FgenesB (Softberry Inc., New York, USA), GeneMark [34] and Glimmer [35] were used independently for prediction of open reading frames (ORFs) in the Samoa D genome sequence. Visualization of gene prediction was performed using a Genboree system (www. genboree.org) and the CONAN database [36,37]. To predict genes coding for tRNA, rRNA and non-coding RNAs, tRNAscan [38], RNAmmer [39] and Rfam [40] were used. DNA comparisons were performed using BLASTN and BLASTX algorithms. Protein sequences were analyzed using BLASTP versus the nr database at NCBI [41]. When appropriate, other predictive tools were used as described previously [42]. For proteins with unpredicted functions, a gene size limit of 150 bp was applied. For the CDC-2 and Gauthier gene annotations, predicted gene coordinates from Samoa D genome were adapted and recalculated. When needed, genes were newly predicted in some regions. In most cases, the original locus tag values of annotated Nichols genes were preserved in TPE orthologs. Genes newly predicted in TPE genomes were named according to their preceding gene with a letter suffix (e.g. a).

Comparisons of whole genome sequences
Three TPE complete genome sequences were compared to the genomes of four TPA strains (Nichols, SS14, Chicago and DAL-1).
The Nichols strain genome (AE000520.1) was completely sequenced by Sanger sequencing method (DDT) and coding regions (ORFs) were identified with compositional analysis [23,43]. Comparative Genome Sequencing (CGS) approach and DDT method were used for sequencing of the SS14 strain (CP000805.1). Identification of ORFs and gene annotations were based on Nichols predicted gene coordinates that were adapted and recalculated [29]. Nichols and SS14 genome sequences were improved by 454 and Illumina resequencing (unpublished data). The Chicago genome (CP001752.1) sequencing was based on Solexa method, annotation was made by J. Craig Venter Institute Annotation Service [44]. A combination of 454 sequencing, Solexa method and Sanger sequencing was used for determination of the complete DAL-1 genome sequence (CP003115, unpublished results). Identification of ORFs and annotations were based on the predicted gene coordinates from the Nichols and Samoa D genomes.
Two approaches were used to determine genome differences between and within subspecies including a script based on Cross match software (unpublished) and whole genome alignments. Whole genome alignments were calculated using SeqMan, a Lasergene program package software (DNASTAR, Madison, WI, USA), with manual corrections. These methods were used for gene annotations and for calculation of nucleotide changes present in intergenic and coding regions. Both tprD and tprK genes were excluded from these calculations because of their high sequence diversity between TPA and TPE strains. Subsequently, nucleotide changes between studied subspecies and within them were analyzed using DnaSP software, version 5.10 [45]. Genetic relatedness was characterized by calculation of nucleotide diversity (p) within TPE and TPA strains. Alignment of TPE and TPA strains was used to calculate whole genome nucleotide divergence (d A ). Complete TPE and TPA nucleotide sequences (including tprD and tprK genes) were used for these calculations.

Gene classification
TPE genes were classified into eight groups according to their probable function and into four categories (Table 1) according to the number of amino acid changes in the corresponding proteins coded compared to TPA proteins. The categories included (1) genes encoding identical proteins, (2) proteins with single amino acid substitution, (3) proteins with two to five amino acid changes, and (4) proteins with more than 6 amino acid changes or multiple sequence changes (MSC). MSCs were defined as 15 or more dispersed amino acid substitutions and/or indels longer than 10 amino acids and/or truncated (results of frameshift mutations, nonsense and start codon changing mutations) and/or full length proteins (results of reverted frameshift mutations). The distribution of genes into categories and gene functional groups was statistically analyzed.

Statistical analyses
STATISTICA version 8.0 (StatSoft, Tulsa, OK, USA) was used for statistical calculations. In addition, an interactive calculation tool for chi-square tests of ''goodness of fit'' and independence was used [46]. The numbers of genes in four categories (classified according to number of amino acid changes in the corresponding proteins) of each functional gene group (Table 1) were compared to the gene numbers found in the general metabolism functional group. The p-value was corrected using the Bonferroni correction for multiple comparisons to p = 0.008 and p-values less than 0.008 were considered to be statistically significant.

Type of selection
For genes encoding proteins with two or more amino acid changes, the codon-based test for estimation of selection type was calculated using the Kumar model [47] and MEGA4 software [48]. Before prediction of selection type, the presence of possible recombination events was tested in whole genome alignments of TPE and TPA strains using the Recombination Detection Program version 3.44 (RDP3) [49]. Three methods implemented in the RDP3 program were used including the RDP method, the GENECONV [50] and the Maximum Chi-squared (MaxChi) method [51,52]. Non-default settings comprised a maximum p-value of 0.01 and a Bonferroni correction. For the RDP method, a window size of 10 nt was used. For the MaxChi method, the number of variable sites per window was set to 30. Recombinant regions predicted by all three methods were considered as significant and the affected genes were omitted from further analyses. Gene sequences containing frameshift mutations were also removed from alignment and analyses. Genes with predicted positive or purifying selection were subsequently tested for possible intragenomic recombination events using BLAST search. Expected threshold was set to 50, word size to 16 and 20, respectively. TPE genes containing sequences from other part of the genome causing differences between TPA and TPE sequences were considered as genes containing recombinant sequences and were removed.

Nucleotide sequence accession numbers
The genomes of TPE Samoa D, CDC-2, and Gauthier were deposited in the GenBank under the accession numbers CP002374, CP002375, CP002376, respectively.

Results
Whole genome sequencing, intrastrain variability in Treponema pallidum ssp. pertenue genomes, genomic parameters and genome annotation The genome of Treponema pallidum ssp. pertenue (TPE) Samoa D was sequenced using 3 independent whole genome sequencing methods including comparative genome sequencing approach (CGS), pyrosequencing and Solexa sequencing (see Material and Methods). In addition, the Sanger sequencing method was used for finishing the complete genome sequence. The genomes of T. pallidum ssp. pertenue CDC-2 and Gauthier were sequenced using 2 methods, pyrosequencing and Solexa technology. CGS method was already used previously for sequencing of the SS14 genome [29] and the sequence of the strain Chicago was obtained by Solexa sequencing [44].
Analysis of the Solexa data revealed 8, 15, and 2 genome positions showing intrastrain variability in the TPE CDC-2, Samoa D, and Gauthier genomes, respectively. The most variable gene included TPE_0897 (5 positions in the CDC-2, 13 in the Samoa D, and 2 in the Gauthier genome). In addition, the CDC-2 genome showed heterogeneity within the TPECDC2_0313 (2 positions) and TPECDC2_0622 (1 position) genes, and the Samoa D genome in the TPESAMD_0110 (1) and TPESAMD_0134 (1) genes.
TPE Samoa D genome was annotated de novo and this annotation served as a reference for annotation of CDC-2 and Gauthier genomes. The following names TPESAMD, TPEGAU and TPECDC2 were assigned as gene names for the loci of Samoa D, Gauthier and CDC-2 strain, respectively. Gene names for the loci of all TPE strains start with the prefix TPE. The summarized genomic features of the TPE strains Samoa D, CDC-2 and Gauthier are shown in Table 2. All three genomes had identical GC content (52.8%) and similar length ranging between 1,139,330 bp and 1,139,744 bp. The difference of 414 bp between the largest (CDC-2) and the smallest (Samoa D) genome represents only 0.036% of the CDC-2 genome. The genome structure was identical in all strains and no major genome rearrangements were found. In all genomes, 1125 genes were annotated including 54 noncoding RNA genes. The noncoding RNA genes comprised tRNA (45), rRNA (6) and ncRNA (3) genes. The average and median gene lengths were Table 1. Distribution of identified differences between TPA and TPE proteins encoded by different functional gene groups. a statistically significant results when compared to the genes encoding components of general metabolism. The p-value was corrected using Bonferroni correction for multiple comparisons to p = 0.008 and p-values lower than 0.008 were considered as statistically significant. p-values lower than 0.05 are also shown. b MSC (major sequence changes) were defined as 15 or more dispersed amino acid substitutions, indels longer than 10 amino acids, truncated and full length proteins (results of frameshift and reverted frameshift mutations, nonsense and start codon changing mutations With two exceptions, all genes predicted in the Samoa D genome were also predicted in the CDC-2 and Gauthier genomes. The first exception was TPESAMD_0924a (encoding a hypothetical protein), predicted only in Samoa D and not in the CDC-2 and Gauthier genomes (CDC-2 and Gauthier orthologs were shorter than the used gene length limit of 150 bp). The second exception included prediction of two genes in the CDC-2 and Gauthier genomes (TPEGAU_0126 and TPEGAU_0126a; TPECDC2_0126 and TPECDC2_0126a), while only one gene was predicted in the Samoa D genome (TPESAMD_0126). However, TPESAMD_0126a included both 0126 and 0126a sequences ( Figure 1, Table 3).
Compared to the previously published TPA Nichols genome sequence (AE000520, [23]), the orthologs of TPE genes TP0206a, TP0451a, TP0949a, and TP0950a were later annotated in the Nichols genome (GenBank AE000520.1). These genes are therefore not listed in Table S1, which lists only newly annotated genes. Moreover, the updated Nichols genome sequence (GenBank AE000520.1) does not contain TP0950 and therefore, this gene is not shown in Table S2 (TPA Nichols genes not predicted in TPE genomes).
Compared to the Nichols genome (AE000520.1), we found 95 newly predicted genes (Table S1) in pertenue genomes. They coded for hypothetical or hypothetical membrane proteins (HP or HMP, 89 genes), for conserved hypothetical proteins (CHP, 3 genes), for treponemal conserved hypothetical proteins (TCHP, 2 genes) and for a transport protein (1 gene, predicted preprotein translocase). The newly annotated genes had considerably lower median length (228 bp) compared to all annotated genes (831 bp). TPA sequences orthologous to 5 of the newly predicted TPE genes (TPE_0126b, TPE_0134a, TPE_0349a, TPE_0548a, TPE_0911a) contained frameshifts or start codon mutations and were considered pseudogenes in TPA genomes.
A set of 33 genes (with a median length of 126 bp) predicted in the Nichols chromosome [23] were not annotated in TPE genomes (Table S2) because of the 150 bp gene limit (see Materials and Methods) or other genes annotated at those loci. These genes were annotated in TPA Nichols, but only as hypothetical proteins, having no similarities to genes in other genomes. However, all the corresponding orthologous sequences were present in the TPE genomes.

Whole genome fingerprinting
The recently published data on whole genome fingerprinting of TPE genomes [19] were used to assess the overall assembly of the genome sequence and also to compare in silico identified restriction target sites (RTS) with experimental restriction digest patterns of individual TPI regions covering the entire TPE genomes. Altogether, 1,773 RTSs representing more than 10.6 kb were No. of annotated pseudogenes (no. of all pseudogenes) 6 (13) 6 (13) 6 (13) No. of tRNA loci 45 45 45 No. of rRNA loci 6 (2 operons) 6 (2 operons) 6 (2 operons) experimentally tested [19]. According to results obtained from WGF of the CDC-2 strain, one RTS (Sph I site, in the interval TPI21A) was missing in the complete sequence of the CDC-2 genome. Extensive sequencing of this region revealed no Sph I site suggesting that non-specific cleavage of the CDC-2-amplified DNA had occurred. Shorter incubations with Sph I (1 hour) did not result in such cleavage (data not shown). No discrepancies between in silico and experimental RTS analyses of Samoa D and Gauthier genomes were found. The estimated sequencing error rate for TPE genomes was therefore 10 24 or lower [33].

Gene fusions
Compared to the Nichols genome (AE000520.1), 50 TPE orthologs were fused into 24 genes (see Table 3). With the exception of 2 gene fusions (TPE_0314, fusion of TP0314 and TP0315; TPE_0859, fusion of TP0859 and TP0860), all fusions found in all TPE genomes were also present in the resequenced Nichols genome (data not shown). Interestingly, all other investigated TPA strains (SS14; P. Pospíšilová, personal communication, DAL-1; data not shown; Chicago, [44]) contained the same number of fused genes as the resequenced Nichols genome including separated TP0314, TP0315, TP0859, and TP0860 genes.

Pseudogenes
Altogether, 6 annotated and 7 non-annotated pseudogenes were found in the TPE genomes. Three pseudogenes (TPE_0146, TPE_0520, and TPE_0812) were annotated in TPE genomes as authentic frameshifts similar to the annotation of the Nichols genome (AE000520.1). Additional 3 pseudogenes were annotated only in the TPE genomes with gene counterparts in the Nichols genome (TPE_0129, TPE_0370, TPE_0671). TPE genome sequences corresponding to Nichols orthologs (TP0132, TP0135, TP0180, TP0266, TP0318, TP0532, and TP1030) were not annotated in TPE genomes as pseudogenes (Table S2) because of overlap with other genes (TPE_0532) or because gene length limits in the de novo annotations. However, all these orthologs were considered as pseudogenes in the TPE genomes.

Genetic differences between syphilis and yaws treponemes
With the exception of non-annotated pseudogenes, the newly predicted TPE genes (90 genes; Table S1) and genes not predicted in TPE (33 genes; Table S2) but predicted in the Nichols genome represented differences in annotation algorithms and not real sequence changes since the corresponding orthologous sequences were present in both TPA and TPE genomes.
In addition to pseudogenes (13), the genetic differences were also characterized in 970 similarly annotated protein-coding genes in both TPE strains (Samoa D, CDC-2, and Gauthier) and TPA strains (Nichols, DAL-1, SS14, and Chicago). In a majority of these genes (in 794 out of 970, 81.9%), there were no differences in lengths of annotated orthologous genes. In 155 genes (16.0%), differences in the gene length between TPA and TPE genes resulted from different algorithms used for gene prediction or from sequencing errors in the published Nichols genome sequence (AE000520.1) and did not represent real sequence differences between TPE and TPA genomes. In addition, length differences in 14 genes were caused by strain specific sequence variants, specific for one of the sequenced TPE genomes. Therefore, these differences were also not considered as capable of differentiating between TPE and TPA genomes.
A set of 692 out of 983 TPE (970 similarly annotated proteincoding genes in both subspecies and the 13 pseudogenes) protein encoding genes (70.4%) were found to encode either identical proteins or identical proteins with strain specific changes ( Table 1,  Table S3), 194 (19.7%) genes encoded proteins with 1 amino acid substitution consistently present between all tested TPE and all tested TPA genomes, 63 (6.4%) genes encoded proteins with 2 to 5 amino acid substitutions and 34 (3.5%) genes encoded proteins with 6 or more amino acid replacements, coded for full length proteins (as a result of reverted frameshift mutations), or truncated Genome Sequences of T. p. ssp. pertenue Strains www.plosntds.org proteins (encoded by pseudogenes). TPE genes in these categories were further subclassified according to their functional groups including general metabolism, cell processes and structure, DNA metabolism, regulation and gene expression, transport, virulence and unknown function (Table 1). A statistically significant decrease was found in the category of genes encoding identical proteins in the virulence group compared to corresponding genes from the general metabolism group. A statistically significant increase was found within the virulence and unknown function groups in the category of genes encoding proteins with 6 or more amino acid replacements and/or MSC (major sequence changes, see Table 1 and Materials and Methods section for details).
TPE genes encoding proteins with no or one amino acid replacement compared to TPA proteins A set of 451 (45.9%) TPE genes were found to be identical in all studied TPE and TPA strains. Compared to TPA genes, 602 (61.2%) TPE genes encoded identical proteins, and 692 (70.4%) TPE genes encoded either identical proteins or identical proteins with only strain specific changes (Table S3). Altogether, 886 (90.1%) TPE genes encoded proteins with a maximum of 1 amino acid difference or strain specific changes. 283 out of these 886 genes encoded proteins of unknown function and the remaining 605 genes encoded proteins involved in the gene regulation, transcription and translation (169 genes), general metabolism (151 genes), transport (102 genes), cell processes and structure (119 genes), DNA replication, repair and recombination (46 genes), or virulence (18 genes) (see Table 1).

Type of selection in divergent genes
TPE genes encoding proteins with 2 or more amino acid replacements and/or other major sequence changes were tested for prediction of selection type. Possible recombination events were identified in TPE genes orthologous to TP0136, TP0316, TP0433, TP0618, TP0619, TP0898, TP1031 and together with previously predicted recombined genes (TP0117, TP0131, TP0620, TP0621, [56]) were omitted from further analyses. All TPE orthologous genes with frameshift (TP0103, TP0132, TP0135, TP0180,  TP0314, TP0318, TP1030), reverted frameshift mutations (TP0009) and long deletions (TP0266) were also excluded from The corresponding protein was identified as an antigen [54]. f The corresponding protein was identified as a lipoprotein [55]. g The selection test was calculated using the Kumar model [47] using MEGA4 [48]  analyses. In addition, the tprK gene, showing intrastrain variability [57], was also omitted.
In 17 and 1 of 78 tested genes, positive and purifying selection between TPA and TPE orthologous genes were identified, respectively. Nine genes with prediction of positive selection type encoded ( Table 4, Table S4) proteins with predicted functions including antigens (tp92), proteins involved in cell processes and structure (mcp, tig, fliK), transport (TPE_0143, nptA), general metabolism (asnA, thiI) and translation (argS). With exceptions of asnA, thiI and argS, all other genes encode proteins predicted as membrane or periplasmic proteins. Eight genes ( Table 5, Table S5) of unknown function were predicted under positive selection and in four of them, a signal sequence was predicted. In addition, TPE_0515 was predicted as an outer membrane protein. Moreover, 6 out these 8 hypothetical genes were highly transcribed during experimental Nichols infections in rabbits [58].
TPA strains differ completely in intergenic regions with 43 changes (16 indels, 27 substitutions) and in coding regions with 379 nucleotide changes causing 96 synonymous, 33 conservative, 249 non-conservative amino acid substitution, and one protein elongation (as a consequence of a single read-through stop codon mutation).
Nucleotide diversity calculated among and between subspecies strains is summarized in Table 7. The computed average number of nucleotide differences between both subspecies was 2111.7 nucleotides, which represents 0.185% of the smallest Samoa D genome. The whole genome nucleotide divergence (d A ) between subspecies is 4.7 and 4.8 times higher when compared to observed nucleotide diversity (p) within TPA and TPE strains, respectively.

Discussion
The complete genome sequences of three yaws-causing T. pallidum ssp. pertenue strains, i.e. Samoa D, CDC-2, and Gauthier, were determined. Each genome was sequenced using at least two independent techniques and the discrepancies were resequenced with Sanger sequencing to obtain high quality genome sequences. Genetic heterogeneities discovered during analysis of individual Solexa reads were, with the exception of the TPE_0110, found in the genes belonging to paralogous families. Therefore, most of the observed intrastrain heterogeneities represent rather misaligned Solexa reads than real intrastrain genetic variability. The whole genome fingerprinting data, published earlier [19], revealed a high quality sequenced genomes with assessed sequencing error rates better than 10 24 . Deciphering the whole genome sequence of these uncultivable pathogens is an essential step before additional studies can be performed [58][59][60][61][62].
Striking sequence similarities (higher than 99.8%) to TPA genomes were found comprising identical genome structure and similar genome sizes. The Samoa D strain was isolated in Western Samoa in 1953 [63], the CDC-2 in Akorabo, Ghana in 1980 [64], and Gauthier strain in Brazzaville, Congo in 1960 [65]. Although the sequenced TPE strains were isolated at different time points and different geographical regions, their genome sizes were very similar, representing only a 414 bp difference between the largest (CDC-2) and the smallest (Samoa D) genome (0.036% of the CDC-2 genome). In addition, G+C content, number of predicted genes and gene order was identical in all genomes reflecting low nucleotide diversity of individual genomes among TPE subspecies strains (0.00032). The whole genome nucleotide divergence between TPA and TPE subspecies was 4.7 and 4.8 times higher than the observed nucleotide diversity among each subspecies indicating close evolutionary relatedness between yaws and  [66]. The classification of syphilis and yaws treponemes into TPA and TPE subspecies, respectively, is therefore based on very subtle genetic differences. Interestingly, the yaws treponemes were found to be closely related to the unclassified treponemal simian isolate Fribourg-Blanc [19,67]. The Fribourg-Blanc treponemes were isolated in 1962 [68] from a baboon (Papio cynocephalus) in Guinea, Africa. Indeed, experimental inoculation of humans with the Fribourg-Blanc strain resulted in symptoms similar to yaws [69]. Although syphilis and yaws share a very similar multistage course of symptoms including a late-stage disease phase and the same histopathological changes, there are differences in clinical manifestations and epidemiology between yaws and syphilis. These differences comprise a different route of transmission, the extent of tissues and organs that can be affected during infection and transplacental infection of the fetus in syphilis [70]. In general, yaws treponemes can be considered less virulent compared to syphilis treponemes. Based on these differences, syphilis and yaws treponemes were originally considered as separate species; however, since 1984 they have been classified as subspecies [11] based on DNA hybridizations experiments [10]. Since that time, several publications showing genetic differences between TPA and TPE have been published, including differences in the TP1038 gene at position 122 [12,13], in the 16S rRNA gene (unpublished data, cited in [14]), in the 59-and 39-flanking regions of TP0171 [15], in the gpd gene (TP0257) at position 579 [16], in tp92 [17], and in tprI and tprC [18]. In this study, about 1192 single nucleotide substitutions and 12 chromosomal regions with more extensive changes (TPE_0117, TPE_0131, TPE_0132, TPE_0136,  TPE_0316, TPE_0317, TPE_0433, TPE_0548, TPE_0620,  TPE_0621, TPE_0897, TPE_1031) were found to be present between all investigated TPE and TPA strains (Figure 2). This set of genetic differences represents the pool of potentially important changes differentiating syphilis and yaws treponemes. Sequencing of additional TPE and TPA genomes will eventually lead to further downsizing of this list of differences. Assuming that changes that lead to the differing pathogenic potential between yaws and syphilis treponemes represent changes in protein sequence and/or in intergenic regulatory regions (possibly affecting gene expression patterns), then the list can be shortened by 369 synonymous mutations leaving 823 intergenic and non-synonymous mutations.
Since 70.4% of genes encode either identical proteins or identical proteins with strain specific changes, potential ''yaws specific'' mutations should be found among 34 (3.5%) genes encoding proteins with 6 or more amino acid substitutions and/or major sequence changes (including pseudogenes), among 63 (6.4%) genes encoding proteins with 2-5 amino acid changes, or among 194 (19.7%) genes encoding proteins with 1 amino acid substitution. Since genes encoding putative virulence factors were enriched in the category of genes encoding proteins with 6 or more amino acid replacements and/or MSC compared to other gene categories (i.e. genes encoding identical proteins, genes encoding proteins with one amino acid replacement, and genes encoding proteins with 2-5 amino acid changes), it is likely that the ''yaws specific'' changes are to be found in the group of putative virulence factors. Similarly, an increased (though not statistically significant) percentage of genes encoding 6 or more amino acid replacements and/or MSC was found in the group of genes of unknown function. In fact, some of the hypothetical proteins may encode for yet unknown virulence factors. This prediction is supported by the fact that the genes encoding hypothetical proteins with 6 or more amino acid replacements and/or MSC (20 genes, Table 5) are highly transcribed (average transcription rate 2.3) in the TPA Nichols strain during experimental rabbit infection [58], while the average transcription rate calculated from values for all genes was 1.0. In addition to this finding, several of these genes were found to contain MSC in the genome of non-human pathogen T. paraluiscuniculi Cuniculi A [33], further suggesting their potential role in human treponemal infections. Moreover, genes responsible for ''yaws specific'' changes could be found in the group of 17 genes predicted under positive selection between TPA and TPE strains. The positively selected genes were often found to encode proteins with predicted signal sequences or membrane proteins including predicted lipoproteins and outer membrane proteins suggesting that these proteins may be responsible for differences in pathogenesis between syphilis and yaws. Genes with predicted positive selection encoding components involved in general metabolism, transport, cell processes and cell structure suggest that positive selection of these genes may be related to adaptive processes, such as climate adaptation. As shown by previous studies [15,56,67], T. pallidum ssp. endemicum is very closely related to pertenue subspecies. Therefore, many ''yaws specific'' mutations are likely to be in the future identified also in the ssp. endemicum strains. For instance, out of 1192 nucleotide substitutions differentiating all tested TPE strains from all TPA strains, 551 (46%) were found also in the T. paraluiscuniculi genome.
The only pseudogene with a known function (TPE_0671), identified in the TPE genomes, coded for ethanolamine phosphotransferase. The phosphoethanolamine transferase of Campylobacter jejuni is involved in modification of the lipooligosaccharide lipid anchor lipid A and also in the modification of the flagellar rod protein FlgG [71]. Although the role of this enzyme is not known in lipopolysaccharide-free treponemes, the flgG genes (TP0960, TP0961) are present and conserved in the TPA and TPE genomes. On the other hand, a low expression rate of TP0671, in TPA Nichols [58], may suggest a nonessential role for this gene.
In addition to the ethanolamine phosphotransferase pseudogene (TPE_0671), 13 additional genes with predicted function showed 6 or more amino acid changes and/or MSC including 8 tpr genes. Although the function of Tpr proteins were not directly demonstrated, the Tpr proteins are considered important factors in pathogenesis and/or immune evasion [56,72]. Tpr proteins were shown to exhibit heterogeneity both among and between the T. pallidum subspecies and strains and were shown to induce an antibody response during treponemal infection [73][74][75]. Moreover, a gene conversion-driven model for antigenic variations of TprK during experimental infection has been proposed [76]. This work provides further indirect evidence of possible roles of tpr genes in treponemal virulence. Interestingly, eight tpr genes including tprB,C,D,E,F,I,J,L out of 12 tpr genes were predicted to be genes encoding rare outer membrane proteins (OMPs) in T. pallidum [77]. Since six out of these 8 OMPs were found to contain changes in TPE strains, the Tpr rare OMPs may contribute to differences in pathogenesis between syphilis and yaws. Besides the elongated RecQ protein (TPE_0103) and sequentially divergent Mcp (methylaccepting chemotaxis protein, TPE_0488), three treponemal antigens were found to display substantial sequence diversity: the fibronectin-binding protein (TPE_0136, [78]), the Tp92 -outer membrane protein (TPE_0326) and the Arp protein (TPE_0433).
TP0136 is an outer membrane protein involved in attachment to host extracellular matrix proteins. Although immunization with recombinant TP0136 did not prevent TPA infection, delayed ulceration in experimental animals has been demonstrated [78].
The Tp92 is an outer membrane antigen that partially protects rabbits from subsequent TPA infection [17]. Tests with recombinant Tp92 protein exhibited high sensitivity and specificity values and Tp92 is considered a promising diagnostic antigen in syphilis serodiagnostics [79]. Moreover, tests of Tp92 homologs, that are common among the oral Treponema species associated with periodontitis, have shown that these proteins have the potential to bind to host cells and stimulate host factors that contribute to inflammation and osteoclastogenesis [80]. The arp gene [81] contains a variable number of 60 bp-repetitive sequences and corresponds to the TP0433 and TP0434 gene loci of the published Nichols genome sequence [23]. The Arp protein has been shown to be immunogenic [82] and the variable number of its tandem repeats suggests important biological function [83,84]. Based on amino acid sequence variations, Arp repeat motifs were classified into 4 types in T. pallidum (I, II, III, II/III) [67,82] and the variability of its repeat types was correlated with a sexual transmission strategy [67] with invariant repeat type (II) in TPE strains.
Current serological diagnostic methods do not allow discrimination between yaws and syphilis infections. Although the differences in clinical manifestation and epidemiology between yaws and syphilis allow correct diagnosis in most cases, sometimes it is quite difficult or impossible to distinguish between yaws and congenital syphilis in children [85] or in pregnant women [86]. This can lead to improper diagnosis and treatment with important health consequences [86]. A complete compendium of TPEspecific differences is presented in this work and should allow selection of the most suitable chromosomal regions for differential identification of TPE infections. The correct diagnostics of yaws treponemes is also one of important steps in monitoring and eradication of yaws. Moreover, a list of specific changes for each of the investigated TPE genomes (Samoa D, CDC-2, and Gauthier) was also generated. Although these changes cannot account for differences in pathogenicity between TPA and TPE, they represent suitable targets for specific molecular diagnostics of individual strains.
Taken together, the yaws treponemes are very closely related to syphilis treponemes although they are less virulent. TPE genomes differ minimally from TPA genomes with approximately 5-times higher nucleotide divergence than the nucleotide diversity observed among TPA and TPE subspecies strains, respectively. The decreased degree of TPE virulence is likely the result of accumulation of genetic changes in tpr genes, genes encoding prominent antigens, and some other genes. Genes coding for hypothetical proteins with consistently identified nucleotide differences between TPE and TPA strains are candidates for virulence factors of syphilitic treponemes.

(DOC)
Table S2 T. p. ssp. pallidum Nichols genes not annotated in T. p. ssp. pertenue genomes. Genes predicted in the Nichols chromosome were not annotated in TPE genomes because of the 150 bp gene limit or other genes annotated at those loci. However, all the corresponding orthologous sequences were present in the TPE genomes. (DOC)