Comparative genome analysis of VSP-II and SNPs reveals heterogenic variation in contemporary strains of Vibrio cholerae O1 isolated from cholera patients in Kolkata, India

Cholera is an acute diarrheal disease and a major public health problem in many developing countries in Asia, Africa, and Latin America. Since the Bay of Bengal is considered the epicenter for the seventh cholera pandemic, it is important to understand the genetic dynamism of Vibrio cholerae from Kolkata, as a representative of the Bengal region. We analyzed whole genome sequence data of V. cholerae O1 isolated from cholera patients in Kolkata, India, from 2007 to 2014 and identified the heterogeneous genomic region in these strains. In addition, we carried out a phylogenetic analysis based on the whole genome single nucleotide polymorphisms to determine the genetic lineage of strains in Kolkata. This analysis revealed the heterogeneity of the Vibrio seventh pandemic island (VSP)-II in Kolkata strains. The ctxB genotype was also heterogeneous and was highly related to VSP-II types. In addition, phylogenetic analysis revealed the shifts in predominant strains in Kolkata. Two distinct lineages, 1 and 2, were found between 2007 and 2010. However, the proportion changed markedly in 2010 and lineage 2 strains were predominant thereafter. Lineage 2 can be divided into four sublineages, I, II, III and IV. The results of this study indicate that lineages 1 and 2-I were concurrently prevalent between 2007 and 2009, and lineage 2-III observed in 2010, followed by the predominance of lineage 2-IV in 2011 and continued until 2014. Our findings demonstrate that the epidemic of cholera in Kolkata was caused by several distinct strains that have been constantly changing within the genetic lineages of V. cholerae O1 in recent years.


Introduction
Cholera is an acute life-threatening diarrheal disease and remains a major health threat, particularly in developing countries in Asia, Africa, and Latin America [1,2]. It is estimated that 1.4 to 4.3 million cases of cholera and 28,000 to 142,000 deaths due to cholera, occur every year worldwide [3]. The Gram-negative bacterium Vibrio cholerae has more than 200 serogroups, but only O1 and O139 serogroups are responsible for epidemic and pandemic cholera [4,5]. Serogroup O1 has been further classified into two biotypes, classical and El Tor, based on several phenotypic traits. Each biotype has a unique nucleotide sequence within the genes encoding the cholera toxin responsible for severe diarrhea [4]. The cholera toxin is encoded on the lysogenic bacteriophage CTXF and consists of two subunits, A and B [6]. Corresponding amino acids at 20/39/68 of the cholera toxin B subunit of the classical biotype are H/H/T (ctxB1 genotype), and of the El Tor biotype are H/Y/I (ctxB3 genotype) [7]. Historically, seven cholera pandemics have been recorded since 1817. The sixth, and presumably earlier pandemics, emerged from the Bay of Bengal and were caused by the V. cholerae O1 classical biotype. However, the current seventh pandemic is caused by the El Tor biotype [5]. Recently, Hu et al. used comparative genomic analysis to demonstrate that the seventh pandemic strains originated from a nonpathogenic strain first observed in 1897 and slowly acquired virulence-associated elements by 1954 before becoming pandemic in 1961 [8]. Over the years, the V. cholerae O1 El Tor biotype has shown remarkable change and developed novel pathogenic variants that have the classical type ctxB gene (ctxB1) with an El Tor type genomic backbone [9][10][11][12][13][14][15]. Recently, the new ctxB variant (ctxB7) was found in Haiti, amongst other countries, which has N/H/T at the amino acid position 20/39/68 of the cholera toxin B subunit [16]. This newly appeared variant of V. cholerae has totally replaced the old strains indicating that the predominant strain has shifted during the current pandemic [16].
It has also been reported that the Haitian variant strain has evolved due to sequential events in the Indian subcontinent with some cryptic modification in the genome [17]. Mutreja et al. reported that the seventh pandemic strain first appeared in the Bay of Bengal and recurrently spread from this area to different parts of the world in at least three waves [18]. The Bay of Bengal is therefore considered the epicenter for the seventh cholera pandemic. It is important to monitor the appearance of new variants of V. cholerae in the Bay of Bengal, as it is possible that they will spread around the world in the future.
From the active diarrheal disease surveillance in the Infectious Diseases and Beliaghata General Hospital in Kolkata, it was established that V. cholerae O1 is one of the most common bacterial pathogens associated with diarrhea, with an estimated 11,000 cases every year [19]. Thus, cholera continues to be an important public health problem; hence, it is important to understand the genetic dynamism of V. cholerae from Kolkata, as a representative area of the Bengal region.
Vibrio Seventh Pandemic Island (VSP) was first detected by comparative genomic analysis of the classical and El Tor biotype strains of V. cholerae O1 [20]. Although the two genomic regions, VSP-I and VSP-II, were identified to be unique in the seventh pandemic El Tor strains, the role of these genomic islands in the pathogenicity of the organism is yet to be established. VSP-II is a 26.9-kbp genomic region composed of 24 genes between VC0490 and VC0516 according to the annotation of whole genome sequence (WGS) of V. cholerae N16961. These include genes encoding RNase, type IV pilin, chemotaxis, DNA repair, and transcriptional regulator [21]. Several variants of VSP-II have been reported in El Tor strains isolated from different continents, including Asia, Africa, and Latin America [22][23][24][25][26]. Therefore, characterization of VSP-II types is helpful in understanding the genetic lineages involved in the global transmission of cholera. In addition, WGS analysis is currently used as a powerful tool for understanding the various functional and evolutionary aspects of the organisms [27][28][29][30][31][32]. In this study, we carried out WGS analysis of V. cholerae O1 strains isolated between 2007 and 2014 from cholera patients in Kolkata in order to determine their genetic lineages. This analysis revealed the heterogeneity of VSP-II in Kolkata. In addition to the VSP-II genotype, phylogenetic analysis based on the whole genome single nucleotide polymorphisms (SNPs) revealed that shifts of predominant strains have occurred several times in recent years in Kolkata.

Heterogeneous VSP-II
To understand the genomic diversity of V. cholerae O1, we randomly selected 10 strains isolated from hospitalized cholera patients in each year between 2007 and 2014. Draft genome sequences of 80 strains were obtained using a next generation sequencer. Sufficient DNA sequencing reads were generated to cover the genome at least 60 folds in 79 strains. One strain was excluded from further analysis due to poor quality of the sequence data. The short sequence reads were mapped onto the genomic sequence of V. cholerae N16961 as an El Tor reference strain. We found that all strains lacked part of VSP-II, which is a 26.9-kbp genomic island consisting of 24 open reading frames (ORFs), VC0490 to VC0516 (Fig 1A). We determined the genetic organizations of VSP-II in these strains.
The region from 119th nucleotide of VC0495 to 1320 bp downstream of the VC0498 stop codon was replaced by a 1257 bp DNA fragment consisting of genes for transposase A and B subunits in 19 isolates (VSP-IIB, Fig 1B). Furthermore, 2 out of these 19 strains had an insertion of a 1767-bp DNA fragment of SXT IS4 family transposase gene at the 510th nucleotide of the VC0516 gene (VSP-IIBv1, S1 Fig) or at the 780th nucleotide of the VC0515 gene (VSP-IIBv2, S1 Fig).
Fifty-eight strains had a larger deletion than VSP-IIB. The 119th nucleotide of VC0495 to 596 bp downstream of the VC0512 stop codon (14,376 bp length) was replaced by genes for transposase A and B subunits (VSP-IIC, Fig 1C). Moreover, 9 of these strains had an insertion of the additional transposase gene fragment at the 560th nucleotide of VC0492 (VSP-IICv1, S1 Fig). The deleted regions are distinct in VSP-IIB and VSP-IIC; however, the upstream terminals of the replaced area (119th nucleotide of VC0495) were identical. These observations may suggest the existence of a hot spot for transposase. Two strains were negative for VSP-II (VSP-IID, Fig 1D). The VSP-II was integrated between two attachment sites, attL (14 bp) and attR (16 bp), in all the VSP-II positive strains ( Fig 1A). Murphy et al. demonstrated that VSP-II could be excised from the chromosome due to VC0516, which encodes an integrase, and the post-excision att site was of a shorter type (14 bp), identical to attL [32]. Classical strains O395 and V51, which are VSP-II-negative isolates, have a 16-bp att site identical to attR. However, VSP-II-negative strains in this study had an att site of 13 bp (Fig 1D). A shortened att site might be a consequence of excision of VSP-II from a VSP-II-positive El Tor strain.
Taken together, we identified three types of VSP-II in clinical V. cholerae O1 isolates in Kolkata (Fig 1). In addition, variants of VSP-IIC that have insertions of an additional transposase gene fragment in VC0492, and of VSP-IIB, that also has an insertion of SXT IS4 family transposase gene in VC0516 or VC0515, were identified (S1 Fig). The results also indicate that the predominant strain of V. cholerae O1 can rapidly shift in a particular area during an endemic, or that diverged strains were present in the environment of a particular area, which might have caused the seventh pandemic. ctxB variants V. cholerae O1 El Tor strains were characterized by several ctxB variants including ctxB3, ctxB1, and ctxB7 in the typical El Tor type, El Tor variant, and Haitian variant, respectively. Genotypes of ctxB and VSP-II of each strain used in this study are shown in S1 Table. Although strains were isolated from hospitalized patients with typical cholera symptoms, two  [36,37], mannose sensitive hemagglutination (MshA) and repeat in toxin (RtxA). Two ctxB-negative strains, as well as the ctxB-positive strain in current study, harbored genes encoding for Hly, MshA and RxtA, but not NAG-ST, structural proteins of T3SS and Chx. The other 77 strains harbored the ctxB gene, either ctxB1 (n = 20) or ctxB7 (n = 57) genotypes. As shown in S1 Table, 18 strains with ctxB1 had VSP-IIB and the other 2 strains had VSP-IID. All 57 strains with ctxB7 had VSP-IIC. Both VSP-II and CTXF prophage are mobile elements and the distance between these elements is more than 1 Mbp on the 2.96 Mbp on chromosome 1 in the reference genome. If the mutations in ctxB and VSP-II were independent events, either element may have affected the acquisition or selectivity of another element.

Phylogenetic analysis
To assess the genetic lineage of V. cholerae O1 clinical isolates in Kolkata, we performed phylogenetic analysis based on the genome-wide SNPs. As shown in Fig   strains with the ctxB1 allele, one Kolkata strain negative for ctxB, and other strains isolated in India and Nepal. In addition to 19 VSP-IIB strains, the two VSP-IID strains were also found in this lineage. This result is consistent with the notion that these VSP-IID strains are not of the classical biotype but instead seventh pandemic strains after the excision of VSP-II, as suggested by the short att site. Each VSP-IID strain is closely related to each other and with the VSP-IIB strains, suggesting that VSP-IID is a derivative of the VSP-IIB strain as a consequence of the excision of VSP-II. Lineage 2 contains 58 strains, of which 57 strains possesses ctxB7 and one negative for ctxB gene. Although all 58 lineage 2 strains harbored VSP-IIC, lineage 2 was divided into four sublineages, I, II, III, and IV (Fig 3). Lineages 2-I, 2-II, and 2-III comprised South Asian isolates; however, 2-IV also contained Haitian isolates. In this lineage, Kolkata strains isolated between 2010 and 2014 were more clustered among themselves owing to the relatedness between Nepalese and Haitian isolates (Fig 3). Strains with VSP-IICv1, which is a transposon-inserted variant of VSP-IIC, formed a cluster (Fig 3), suggesting clonal expansion of the lineage 2-IV with VSP-IICv1. Each of the two ctxB-negative strains was phylogenetically within lineages 1 and 2, therefore seeming to emerge independently.

Temporal analysis
Phylodynamic analysis of the 79 Kolkata strains was investigated via Bayesian analysis (Fig 4). Since the discovery of the prototypical VSP-   2014. CIRS101 isolated in Bangladesh in 2002 had a substitution from VC0495 to VC0512 by transposase, which is identical to VSP-IIC in this study [26]. Moreover, both VSP-IIB isolates and VSP-IIC isolates were found in Chandigarh, a province of northern India, in 2009 [22]. These strains could emerge and spread widely throughout the Indian subcontinent and, further work with retrospective analysis would be required to elucidate the emergence mechanism of VSP-II variants. It has been reported that V. cholerae strains in all pandemics disseminated from the Bay of Bengal to the rest of the world [18] and considering this tendency, it is possible that the new VSP-II variants could spread beyond this region.
Taviani  41,42]. Our work characterized the transition of predominant strains during several continuous years at the epicenter of cholera. Combining these studies with computational modeling may enable us to predict strains that cause epidemics throughout the world.

Bacterial strains
Vibrio cholerae O1 strains used in this study are listed in S1 Table with  The patients were aged between 1 month and 89 years at two hospitals in Kolkata, and all patients were discharged after treatment.

Whole-genome sequencing
Genomic DNA was prepared using the DNeasy Blood & Tissue Kit (Qiagen) according to the manufacturer's instructions. In total, 80 V. cholerae O1 strains were used for whole genome sequencing. We prepared Illumina libraries using Nextera XT DNA Library Preparation Kit (Illumina) and sequenced paired-end Illumina short reads for each library on HiSeq 2500 (Illumina) or MiSeq (Illumina) sequencers. The sequence reads were mapped with V. cholerae O1 El Tor reference strain N16961 by using CLC Genomics Workbench 8.5.1 (CLC bio) to obtain whole genome alignment. The reads were assembled using a de novo genome assembly program of CLC Genomics Workbench and a multicontig draft genome was generated for each sample. Nucleotide sequence data were submitted to the DDBJ Sequenced Read Archive and each accession number is listed in S1 Table. Identification of VSP-II Comparison of short sequence reads with the genomic sequence of V. cholerae N16961, a prototype strain of seventh pandemic El Tor biotype, suggested the lack of a large area in the VSP-II genomic island; thus, genetic organizations of VSP-II in each strain used in this study were identified (Fig 1). Among 79 strains, 19, 58, and 2 were found to lack the ORFs from VC0495 to VC0498, VC0495 to VC0512, and the entire VSP-II sequence, respectively. Strains lacking the entire VSP-II sequence generated a contig cover from upstream to downstream of VSP-II region (Fig 1D).
Nineteen strains lacked the internal region of VSP-II from VC0495 to VC0498 as compared with the N16961 sequence and did not generate a contig cover from upstream to downstream of the missing area. This area of each strain was PCR amplified and the sequence of this amplicon was determined using the Sanger method (Fig 1B). Among these 19 strains, 2 strains had truncated contigs within VSP-II region in addition to the lacking area. These contigs were also amplified and the sequence was determined (S1 Fig). The rest of the 58 strains were found to lack the ORFs from VC0495 to VC0512 as compared with the N16961 sequence. This area of each strain was PCR amplified and determined the sequence of a representative strain (Fig 1C). To confirm the identity of VSP-II, the entire VSP-II fragments of these strains were amplified and analyzed by Restriction Fragment Length Polymorphism (RFLP) using BglII. Nine strains showed altered band patterns compared with the others, and the difference of these VSP-II sequence were identified by PCR amplification and sequencing (S1 Fig).

SNP detection and phylogenetic analysis
To remove adapter sequences and low quality bases with a Phred score of less than 15 from the short reads, read trimming was performed using fastq-mcf (https://expressionanalysis.github. io/ea-utils/) and sickle (https://github.com/najoshi/sickle) program. Simulated paired-end reads were constructed from the available genomic sequences of V. cholerae strains using Sim-Seq software [43] with the following parameter: number of pairs of reads, ''read_number 2000000"; mean library insert size, ''insert_size 150"; and paired-end reads length of 120 mer, ''21 120 22 120". These parameters indicated that 4 million hypothetical 120-mer reads were generated without mutations or indels from the genomic sequences used for SNP identification. The trimmed or simulated reads with at least 40 mer were mapped using the BWA-mem program [44] against the N16961 sequence (NC_002505.1 and NC_002506.1) [45], and the read mapping data were constructed by the samtools program [46]. All SNPs with at least a 5× coverage depth and a Phred score of at least 20 were extracted using VarScan v2.3.4 [47]. The SNPs on the repeat and prophage regions of the N16961 genome, which was identified by the NUCmer [48] and PHAST [49] program, was excluded for further core-genome phylogeny analysis. Additionally, to exclude for the recombination regions, RecHMM was used to identify the recombination region [50]. The remaining SNPs were concatenated to generate a pseudo sequence for phylogenetic analysis; maximum likelihood phylogenetic analysis was performed using RAxML v8.2.0 [51] with 1,000 bootstrap iterations. The trees were visualized using iTOL 3 (http://itol.embl.de/) [52].

Bayesian phylogenetic inference
To estimate the divergence date of the V. cholerae O1 isolate in Kolkata, we performed a temporal analysis using the Bayesian Evolutionary Analysis Sampling Trees (BEAST) v.2.4.4 software package [53]. The isolation date of each strain was used as tip data. A random clock model was implemented using Markov Chain Monte Carlo (MCMC) chains run for 100 million generations with 10% burn-in and sampled every 1000 generations. A GTR nucleotide substitution model was used with a gamma distribution with four rate categories. The effective sample sizes were greater than 200 for all estimated parameters and tree data were summarized to generate the maximum clade credibility tree.

Ethics statement
This study was approved by the duly constituted Institutional Ethics Committee (IEC) of National Institute of Cholera and Enteric Diseases. As per the recommendation of IEC, individual written informed consent was obtained from each adult patient and a parent or guardian of child patient enrolled in this study and confidentiality was maintained.
Supporting information S1