Whole genome sequences of Treponema pallidum subsp. endemicum isolated from Cuban patients: The non-clonal character of isolates suggests a persistent human infection rather than a single outbreak

Bejel (endemic syphilis) is a neglected non-venereal disease caused by Treponema pallidum subsp. endemicum (TEN). Although it is mostly present in hot, dry climates, a few cases have been found outside of these areas. The aim of this work was the sequencing and analysis of TEN isolates obtained from “syphilis patients” in Cuba, which is not considered an endemic area for bejel. Genomes were obtained by pool segment genome sequencing or direct sequencing methods, and the bioinformatics analysis was performed according to an established pipeline. We obtained four genomes with 100%, 81.7%, 52.6%, and 21.1% breadth of coverage, respectively. The sequenced genomes revealed a non-clonal character, with nucleotide variability ranging between 0.2–10.3 nucleotide substitutions per 100 kbp among the TEN isolates. Nucleotide changes affected 27 genes, and the analysis of the completely sequenced genome also showed a recombination event between tprC and tprI, in TP0488 as well as in the intergenic region between TP0127–TP0129. Despite limitations in the quality of samples affecting breadth of sequencing coverage, the determined non-clonal character of the isolates suggests a persistent infection in the Cuban population rather than a single outbreak caused by imported case.

Introduction Treponema pallidum ssp. endemicum (TEN) is the causative agent of endemic syphilis (bejel), a neglected non-venereal disease that is mostly present in hot, dry areas of the world. TEN treponemes are highly related (99.7% identity at the genome level) to the Treponema pallidum ssp. pallidum (TPA), the causative agent of syphilis [1].
Acute bejel infections are mostly found among children between two and 15 years. Like syphilis, bejel can be divided into disease stages. In the primary stage, a small, painless ulcer is usually found in the oral cavity or nasopharynx [2] and usually remains undetected. In the secondary stage, numerous lesions appear in several body areas. In the last stage, gummatous lesions or bone alterations can appear. In several documented cases, bejel treponemes also infected the nipples of nursing women or genital areas [3]. Transmission of this disease occurs typically through direct mucosal and skin contact or is transferred by eating utensils or drinking vessels [2].
While bejel is a disease typically associated with dry arid areas such as the Sahel and the Middle East [2], it has also been found in Canada [4], France [5], Japan [6,7], and Cuba [8]. Cases in Canada and France were explained by bejel being imported from endemic areas, Senegal and Pakistan, respectively [4,5]. On the other hand, the bejel cases in Japan and Cuba were originally identified in patients suspected of having syphilis and with no evidence of disease import and having a sexual route of transmission despite being considered as non-venereal disease [6][7][8].
In bacteriology, the strict definition of a clone tends to be loosened slightly, and clones are defined pragmatically as isolates that are indistinguishable or highly similar, using a particular molecular typing procedure [9]. Among TPA, where molecular typing is widely used, certain predominant genotypes are observed to infect the human population, i.e., the allelic profile 1.3.1. according to the MLST system [10] and 14d/f using enhanced CDC molecular typing [11]. In many cases, individual allelic profiles differs in only one SNV. In this paper, based on the previously determined treponemal genome diversities and treponemal mutation rates, we defined the non-clonal character as a difference between two isolates in more than two nucleotides, which is consistent with at least 10 years period of separate evolution of TEN isolates [10,12].
In the case of TEN, there are very few genomic analyses, and the two available whole genome sequences, differing in 37 single nucleotide variants, come from the reference strains Bosnia A [13] (CP007548) and Iraq B [14] (CP032303). In addition, nine recombinant loci have been detected in TEN isolates during typing studies [15].
In this work, we aimed to determine the whole genome sequences of four TEN isolates from Cuba in order to better assess the clonal character of isolates reported previously using MLST [8,16]. Moreover, the third TEN complete genome is reported.

Ethics statement
The study protocol was approved by the Research Ethics Committee of Institute of Tropical Medicine "Pedro Kourí" (CEI IPK 44-18), and it was conducted in compliance with the Declaration of Helsinki. All participants provided written consent to participate in the study.

Study design and clinical samples
This observational descriptive study includes four clinical samples from bejel patients collected in years 2014, 2015 and 2017. Patients attended to the Instituto de Medicina Tropical "Pedro Kourí," Havana, Cuba and preliminary characteristics were previously published [8,16]. Descriptions of the analyzed samples are shown in Table 1. Patients resided in Havana city and did not refer sexual contact with foreigners in the last six months. DNA was isolated using QIAmp DNA mini kits (Qiagen, Hilden, Germany) according to the manufacturer´s instructions. Following DNA isolation, whole genome amplification was performed using REPLI-g Single Cell kits (Qiagen, Hilden, Germany) [15]. The number of TEN copies in samples was determined by real-time PCR using primers targeting polA [17]. Samples for sequencing were selected according to the (1) total volume of available sample (at least 7.5 μl), and (2) percent positivity of PCR amplification of 14 different T. pallidum intervals (at least 5 out of 14 positive) [18].

Pooled segment genome sequencing (PSGS)
Genomes of samples C178 and C279 were chosen for the PSGS method [13,18,19], reflecting most successful ratios in the amplification of 14 TPIs (Treponema pallidum intervals) of pool 1 (11 of 14 for C279 and six for C178). All primers are listed in S1 Table [13]. PCR products were purified using QIAquick PCR Purification Kits (QIAGEN, Hilden, Germany) and were divided into four pools to separate paralogous genes; this was to avoid later misassembly of these genes. TPI 11A containing tprC from pool 1 and 25B containing tprE from pool 2 was added to pool 4. For sample C178, all PCR products of TPI were mixed in one pool since there were no paralogous regions amplified. PCR products of individual pools were mixed in equimolar amounts. During construction of the DNA library using a Nextera XT DNA Sample Preparation Guide kit, these four pools were labeled with multiplex identifier (MID) adapters. Prepared pools were sequenced using NGS sequencing on an Illumina platform (<150 nt paired-end reads). The numbers of repetitions in the arp and TP0470 genes in sample C279 and the sequence of TP0488 in sample C178 were determined using Sanger sequencing.

Direct sequencing
Samples C75 and C77 were directly sequenced as described previously with <150 nt pairedend reads [18]. Sample C77 was sequenced with and without DNA enrichment, and data from both approaches were combined in the analysis. For DpnI enrichment, 10-40 μl of the clinical sample was mixed with DpnI-coated beads to a final volume of 50 μl in 1.7 mL Eppendorf tubes as previously described [20].
Final sequences were determined by at least two good-quality aligned reads matching the reference genome. To identify SNVs and other changes, at least three good-quality aligned reads were required. The length of homopolymeric region that was found in more than 70% of reads was used in the consensus sequence. When the length of homopolymeric region have predominant reads less frequent than 70% of reads, it was Sanger sequenced.

Annotation of complete genomes
For gene annotation, Geneious software (2021.1.1.) was used. The tprK gene showed intrastrain variability in all samples, and the corresponding nucleotide positions were denoted as "N." Genes from the TEN C279 strain were tagged with the TENDC279_ prefix. In C279, the original locus tag numbering corresponds to the tag numbering of orthologous genes annotated in the TEN Bosnia A genome. In the draft genome of TEN C77 genes were annotated based on TEN Bosnia A and tagged TENDC77. Genome sequences of isolates C279 and C77 were deposited in the NCBI database under the following accession numbers (raw data were deposited under BioProject numbers): CP078090 (PRJNA742865), and CP081507 (PRJNA743168), respectively. Genome sequences of isolates C75 (PRJNA743169) and C178 (PRJNA743172) are available in the Supplementary material of this article (S1 and S2 Text).

Results
The overall results of the whole genome sequencing are shown in Table 2. Comparison of individual genome breadth of coverage is in Fig 1.

Sequence diversity in the Cuban TEN genomes
Comparison of whole genome sequence of TEN C279 with the partially determined genome sequences of C77, C75, and C178 showed sequence diversity among the samples. Differences among genomes and pairwise nucleotide diversity values are summarized in Table 3 and Fig  2A. The phylogenetic relationship between C279 and the previously reported TEN isolates is presented in Fig 2B. For construction of the tree, regions TP0136 (158 208-159 279), TP0548 (591,226-592,285) and TP0856 (932,947-933,182) were used. C75, C77, and C178 were omitted due to unavailable sequences of used loci. For the same reason, Japanese isolate Osaka-2014 was not used. As a root, TPE strain Samoa D (CP002374) was used.
Compared to the complete genome of TEN Bosnia A, TEN Cuban genomes showed 3.9-39.4 nucleotide substitutions per 100 kb of the genome, varying by one order of magnitude.

New recombination events in the C279 genome
In addition to the inter-strain recombination events at the TP0488 and TP0548 loci described earlier [5]; the C279 genome contains (compared to Bosnia A) a total of 332 changes in the tprC gene (TP0117), which is, in part, sequentially identical to the tprI gene of Bosnia A (Fig  3). The tprC of C279 is thus a result of an intrastrain recombination event where the recombinant gene contains about half of tprI copied to tprC. Between TP0127b and the pseudogene TP0128, the TEN genome C279 showed a 72 bp deletion followed by the duplication of a 53 bp-long sequence from TP0129. Similar to the tprCI chimera, the 53 bp-long sequence appears to be copied from TP0129 and inserted at the front of the TP0128 pseudogene. In other treponemal genomes, the region comprising TP0127-TP0129 was shown to be highly variable, including TPA Nichols, TEN Bosnia A, and TPeC strain Cuniculi A (Fig 4).

Discussion
While bejel typically occurs in dry arid areas such as the Sahel and the Middle East-Cuba, with its hot, humid weather, is not a typical region for bejel to occur [2]. In addition, Cuban bejel cases appear to be sexually transmitted, mostly among homosexual men, which is also not typical for bejel [2]. While in several countries, including Canada and France, bejel cases were found to have been imported, no such indication exists for bejel in Cuba [8,16]. This atypical occurence and overlap of treponematoses are highlighting importance of molecular typing.
Whereas a previous paper on molecular typing of syphilis treponemes (TPA) revealed the presence of TEN, all of the nine independent isolates were found to be genetically indistinguishable [8,16]. A similar situation was recently published for Japanese bejel cases where it was found that all isolates were identical when two chromosomal loci were sequenced [6]. Unlike the Japanese isolates [6], Cuban genomes (C279, C178) did not contain mutations (A2058G) leading to macrolide resistance.
Since 2013, there have been numerous reports describing traces of genetic recombination events among pathogenic treponemes at more than a dozen genetic loci [34,35]. While there are examples of both intra-and interstrain recombination events [18], in the genomes of the Cuban strains, we identified two possible intrastrain recombination events. The first intrastrain recombination event between tprG and tprJ has also been described in TPA, where a sequence from tprJ was copied to the tprG gene [18]. Here, we have described a new gene conversion event where the sequence from tprI was copied to the tprC locus. Despite the donor tprI sequence is present also in other genomes, it is more likely that the donor sequence of tprI originated from the same genome. Previous studies also revealed changes at the TP0126-TP0127 loci in the TPA Nichols and TPE strains [35,36], showing that this region differs in gene prediction and region size. Moreover, the region comprising TP0129-TP0130 was previously described as one of the major donor sites for the variable regions of the tprK gene [37]. The emergence of the 53 bp-long sequences copied from the TP0129 by intrastrain gene conversion is therefore not surprising in this region.
The interstrain recombination events in the TP0488 and TP0548 were described earlier [5]. From a previous study, we know that in most of the samples the TP0488 gene sequence was the same as in the 11q/j isolate and therefore likely represented a recombinant allele [5]. Another similarity between C279 and isolate 11q/j was found in gene TP0865, which contains a 23 bp insertion in the same position as the 11q/j harbor 22 bp-long insertions. The size of insertions differs because of the different numbers of nucleotides in the homopolymeric tract. In both cases (i.e., in C249 and in 11q/j), the gene is considered to be non-functional, containing a frameshift mutation. In the C279 genome, the recombinant allele also seems to be present at the TP0548 locus, similar to Bosnia A and Iraq B versions but different from the TEN 11q/j strain [5].
Whole-genome sequencing revealed a surprising amount of genetic diversity within the Cuban isolates ranging between 0.2-10.3 differences per 100 kb, which corresponds to an estimated 22.8 to 117.4 nucleotide differences between individual strains on a genome-wide level. However, these changes are not visible in phylogenetic trees due to the differences in sequenced genome segments and omission of undetermined positions (N nucleotides) from the phylogenetic tree. This extent of genetic diversity is comparable to differences between TEN Bosnia and Iraq B with an average value of nucleotide diversity of 3.1, with 37 single- TEN Bosnia A was used as the reference strain, and the TPE Samoa D strain was identical to TEN Bosnia A. The TEN genome C279 showed a 72 bp deletion and a 53 bp duplication compared to TP0129 of TEN Bosnia A. TPA Nichols strain in IGR showed a 6 bp deletion and a 30 bp insertion compared to Bosnia A. Most of the sequence of Cuniculi A showed deletions, except for 225 bp, which was sequentially unrelated to TP0127-TP0128 in the other strains.
https://doi.org/10.1371/journal.pntd.0009900.g004 nucleotide differences, four indels, two differences in the number of tandem repetitions, and 18 differences in the length of homopolymeric regions found in the Iraq B genome [14] compared to Bosnia A [13]. While TEN strain Iraq B was isolated in Iraq (the Middle East, southwestern Asia) in 1951, the strain Bosnia A was isolated in 1950 in Bosnia, southern Europe. A similar extent of genetic diversity was also found between TPE strains at different times and different geographical regions from which whole genome sequences are available (Samoa D, Gauthier, Kampung Dalan 363, Sei Geringging K403, CDC-1, CDC-2, CDC 2575, and Ghana-051) [12,36,38] ranging between 0.0-20.5 differences per 100 kb. This fact further supports the non-clonal character of Cuban TEN strains and is consistent with the long-term evolution of each of the Cuban TEN isolates. For comparison, TPA SS14-like strains obtained by direct sequencing [18] have nucleotide diversity in a range of 0.2-4.6.
Uncultivable pathogenic treponemes including TPA, TPE, and TEN are monomorphic bacteria [1,39], which have extremely high sequence similarity; therefore, it is likely that these related treponemes show similar mutation rates [39]. While there are no studies on mutation rates in TEN, the mutation rate in TPE and TPA have been estimated as 1.21 x 10 −7 and 0.82 x 10 −7 per nucleotide site per year or lower, respectively [10,12]. Assuming that the mutation rate is similar in TEN as in TPA and TPE, the estimated 22.8 to 117.4 nucleotide differences between individual TEN strains detected in Cuba suggest at least several hundreds of years of separate evolution in the human population and, therefore, the long-term existence of the different TEN strains in the Cuban or related population. On the other hand, several studies presented higher treponemal mutation rates, e.g., 3,02 x 10 −7 [7] or 6,6 x 10 −7 [40] per nucleotide site per year, which would suggest several decades of separate evolution considering the lowest nucleotide difference of 22.8 nt.
Altogether, the findings presented in this study suggest that there are several different sequence types of TEN strains circulating in the population of Cuba. Evidence suggests that they are being transferred mainly through sexual transmission since all source patients were suspected of having syphilis, and most of them had ulcerations in the genital area. Unlike the established concept of treponemal subspecies and corresponding diseases, this study points to the fact that at least in the early stages of the disease, both bejel and syphilis treponemes produce symptoms that are indistinguishable [2,8,16].
The authors recognize that the main limitation of the study was the low number of patients with high-quality samples suitable for sequencing; better samples would have allowed a more robust sequence analysis.
Although, we cannot fully exclude parallel recent TEN introductions to Cuba from an endemic area(s), the non-clonal character of the Cuban TEN isolates suggests that this bacterium cause persistent infections within the Cuban human population.
Supporting information S1