Genome comparison of different Zymomonas mobilis strains provides insights on conservation of the evolution

Zymomonas mobilis has the special Entner-Doudoroff (ED) pathway and it has excellent industrial characteristics, including low cell mass formation, high-specific productivity,ethanol yield, notable ethanol tolerance and wide pH range, a relatively small genome size. In this study, the genome sequences of NRRL B-14023 and NRRL B-12526 were sequenced and compared with other strains to explore their evolutionary relationships and the genetic basis of Z. mobilis. The comparative genomic analyses revealed that the 8 strains share a conserved core chromosomal backbone. ZM4, NRRL B-12526, NRRL B-14023, NCIMB 11163 and NRRL B-1960 share 98% sequence identity across the whole genome sequences. Highly similar plasmids and CRISPR repeats were detected in these strains. A whole-genome phylogenetic tree of the 8 strains indicated that NRRL B-12526, NRRL B-14023 and ATCC 10988 had a close evolutionary relationship with the strain ZM4. Furthermore, strains ATCC29191 and ATCC29192 had distinctive CRISPR with a far distant relationship. The size of the pan-genome was 1945 genes, including 1428 core genes and 517 accessory genes. The genomes of Z. mobilis were highly conserved; particularly strains ZM4, NRRL B-12526, NRRL B-14023, NCIMB 11163 and NRRL B-1960 had a close genomic relationship. This comparative study of Z. mobilis presents a foundation for future functional analyses and applications.

opportunities for strain developments and fundamental research [12]. The first genome sequence of the Z. mobilis ZM4 was published in 2005 [10]. This strain is regarded as a model organism in Z. mobilis research and industrial applications [12,13]. Thus far, the complete genome sequences of 9 Z. mobilis strains have been reported, including two sequences for Z. mobilis NRRL B-14023 (CP4) [14][15][16][17][18][19]. In fact, the genomes of Z. mobilis strains NRRL B-12526 and NRRL B-14023 were sequenced in our lab. Draft genome sequences of two Z. mobilis strains (ATCC 31822 and ATCC 31823) have also been reported [20,21]. Thus, in this article, we chose 8 Z. mobilis strains with complete genome sequences to analyze and compare (Table 1).
Moreover, different srtains of subspecies have different characteristics. For example, NRRL B-14023 is the most aerotolerant, quickly growing, and ethanol-yielding Z. mobilis strain [18]. ATCC 29191 is superior to other Z. mobilis strains in levan production [15]. ATCC 29192 exhibits distinct traits compared to other strains, including low oxygen tolerance, increased nutritional requirements, inability to utilize sucrose, and low DNA hybridization relatedness [20]. Differences of these strains in physiological and fermentation ability may be related to its genome. Although the complete genome sequences of these strains had been published, the comparative genome analysis has not been reported. Comparative genomics analysis of the genomes of different strains simultaneously could identify similarities and differences among them, which could be helpful for a better understanding of the genetic relationships among strains. The results will provide insights on the evolutionary relationships of different Z. mobilis strains and provide guidance for strain engineering to improve ethanol production.
In this study, the genomes of Z. mobilis strains NRRL B-12526 and CP4 = NRRL B-14023 were sequenced in our lab, and comparative genomics was analyzed among 8 Z. mobilis stains. Our aim was to perform comparative genomics analyses on these sequence data to find evolutionary relationships in Z. mobilis.

Strains, genome sequencing, assembly, annotation and whole-genome comparison
The Z. mobilis strains of NRRL B-14023 and NRRL B-12526 were purchased from China Center of Industrial Culture Collection (CICC). Genomic DNA was extracted from strains using  (Table 1). After Sequencing completed, original file of fastq format was download from BGI company ftp server and assessed reads quality with FastQC (loaded in http://www.bioinformatics.bbsrc. ac.uk/projects/download.html). To obtain the best assembled results, we employed two assembly software programs to assemble the genome of Z.mobilis, including Edena V3 [24] and Velvet [25].These scaffolds were ordered relative to the genome of the Z. mobilis strain ZM4, using a combination of the Contiguator [26] and ACT 10.2.0 [27]. The Illumina reads were remapped to the scaffolds using Bowtie 0.12.7 [28] and visualized by Tablet 1.12.02.06 [29]. Annotation was primarily done by Prokka [30], which uses Prodigal 2.6 [31] to predict gene sequence similarity analysis. GO annotations were assigned by Blast2GO [32]. Mauve was applied for whole chromosome genome comparison and plasmid genome comparison [33]. A circular genomic map for the genomes of 8 Z. mobilis strains genomes were compared using the BLAST Ring Image Generator (BRIG) [34].

Phylogeny tree construction
To better understand the evolutionary relationships and genomic variations at the gene level, the phylogenetic relationship of the Z. mobilis strains were constructed based on the complete genome sequences using MEGA6 software [35].

Comparative and pan-genome analysis
A total of 8 Z. mobilis genome sequences and protein sequences were downloaded from NCBI (Table 1). Pan-genome analysis was performed on a larger dataset of these 8 Z. mobile genomes using the Gene Family method in the pan-genome analysis pipeline. All proteins were filtered with the criteria of 50% coverage and 50% identity, and ortholog clusters were generated using MCL software.
Core-genome and pan-genome calculations were performed as previously described by Liu [36]. Orthologous protein sequences among the eight Z. mobilis genomes were defined by OrthoMCL version 2.0 [37]. Briefly, estimations of core genes, new genes, and pan-genome size were performed using all-against-all BLASTp search within and between all genome pairs and all-versus-all WU-TBLASTN searches. Homologous clusters from OrthoMCL were compiled to identify shared and unique genes [36]. The core genes, new genes, and pan-genome size were calculated for each combination and then extrapolated using several functions to find a best fit from the mean number at each sampling point [36,38].

General genomic features and plasmids of Z. mobilis NRRL B-12526 and NRRL B-14023
NRRL B-12526 was composed of a circular chromosome of 1,998,163bp and 5 circular plasmids, pZM1252601 to pZM1252605, which were 33,915bp, 30,952bp, 37,058bp, 32,400 bp and 32,801bp, respectively. The entire genome contained 1,708 protein-coding genes, 51 tRNA genes, and 9 rRNA gene clusters. NRRL B-14023 contained a circular chromosome of 2,012,538 bp and 4 plasmids, pZM1402301 to pZM1402304, which were 33,915bp, 30,952bp, 37,058bp and 32,801bp, respectively. We identified that NRRL B-12526 and NRRL B-14023 genomes shared an average 99.5% identity at the nucleotide level. The G + C content for strains NRRL B-12526 and NRRL B-14023 were 50.8% and 50.7%, respectively. The characteristics of genomes and plasmids were shown in Table 1 and Table 2.

Comparison at the genomic level
Comparative genome analysis was performed on 8 Z. mobilis genomes to provide a picture of the genetic diversity within this species. All strains analyzed in this study, were shown in Table 1, which includes genome size, GC content, and number of plasmids. A circular genome map for each genome was constructed by using the BLAST Ring Image Generator [34]. A visual inspection the circular alignment of the genomes of Z. mobilis (Fig 1) revealed a relatively high sequence similarity; especially the region of 100-1000 kbp. It was identical in 6 isolated strains to the alignment reference genome of NRRL B-12526, except for Z. mobilis ATCC 29192. In these regions we found genes related to the Entner-Doudoroff (ED) pathway, the carbohydrate metabolic process, the nitrogen compound metabolic process and the biosynthetic process. These include amino acid biosynthesis, NAD biosynthesis, carbohydrate biosynthesis, fatty acid biosynthesis and coenzyme A biosynthesis.
The regions at 1000-2000 kbp demonstrated more than 96% identity with other Z. mobilis strains to the alignment reference genome of NRRL B-12526, with some variable chromosome. We observed a regions at 90-100kbp in the chromosome of NRRL B-12526 that were not apparently present in the genomes of other strains, except for NRRL B-14023 (Fig 1). The only genes in the region whose function could be predicted were lipoprotein transporter, deoxyribonucleoside diphosphate metabolic process, ATPase activity. The region of 1320-1330kbp and 1480-1520kbp were absent in other strains, too. The proteins differentially present in the 1320-1330kbp regions were involved in Arylesterase, cyanate transporter, CysJI operon and Shikimate dehydrogenase. While the proteins in regions 1480-1520kbp were glutamine amidotransferase and HTH-type transcriptional regulatory protein. In our lab, comparison of ethanol production was made among 4 Z. mobilis (ZM4, ATCC 29191, NRRL B-14023 and NRRL B-12526) in sweet potato media with different glucose concentrations. The results showed that Z. mobilis NRRL B-12526 could rapidly metabolize substrates of glucose into ethanol (unpublished data). The differences between strains in physiological and fermentation ability may be related to its distinctive gene expression. However, the presences or absences of these distinctive genes were not sufficient to explain their different ethanol fermentation efficiencies, and therefore further experiments are necessary to fully elucidate it.
Interestingly, we found a large gap in Z. mobilis ATCC 29291 between 1120-1140 kbp (Fig  1). In these regions, the proteins were absent in Z. mobilis ATCC 29291 compared to other strains. These proteins included putative endoribonuclease L-PSP, polar-differentiation response regulator divK, Unfoldase HslU, Hup, ATP synthase subunit alpha and beta and Ftype ATPase subunit delta and gamma. Thus we inferred that because these genes were absent, the ATCC 29191 genome was the smallest of the 8 strains (Table 1). Because these genes were absent, Z. mobilis ATCC 29291 was superior to other Z. mobilis strains in levan (polyfructan) production [15].
We found ZM4 had regions, such as 1980-2030kbp with higher GC contents that were not present in the whole genome sequences of other strains (S1 Fig). Chaperonin Cpn10 and GroEL, transcriptional regulator XRE family and GntR family were found in these regions, according to their location in the genome of the alignment reference genome ZM4.
There were some differences between the genome of ATCC29192 and the other stains. We found that more than 40 regions in genome sequence of strain ATCC 29192 were absent when compared to the other strains were absent (Fig 1). The more evident gaps(missing regions) were visible at positions 570-610, 680-710, 1190-1220, 1240-1280, 1380-1400, 1460-1490 and 1760-1770kbp (Fig 1). In these absent regions, ATCC29192 lacked transcription factors of the MarR family and the TetR family and a series of nitrogen fixation-associated proteins. The regions at positions 640-650, 1140-1150, 1170-1230, 1240-1280, 1460-1490 and 1760-1770kbp (S1 Fig) were uniquely present in ATCC29192. Among these Z. mobilis strains, only ATCC29192 was able to encode ferritin δ chain and a specific nitrogen reductase. It is possible that the distinctive genes of these regions caused ATCC 29192 to exhibit distinct traits, such as low oxygen tolerance and increase nutritional requirements [20].
To detect chromosomal rearrangements, deletions, and duplications among strains of Z. mobilis, the alignment of the genomes of all strains were analyze using MAUVE (Fig 2). As showing in Fig 2,there was considerable conservation of the 8 genomes was revealed, although some serotype-specific regions were observed (Fig 2). However, the position of conserved regions in the ZM4 genome was rotated 180 degrees compared to other strains (Fig 2). Perhaps, the genome of ZM4 maybe had rearrangements occur during evolution. In addition, since the genome sequences were obtained by next-generation" sequencing, which was performed on the Illumina HiSeq 2000 platform. Thus the difference between the genome was likely to be related to the assembly method. However, the ZM4 genome used for the reference genome was still appropriate and convincing [12,13]. Two subtype-specific insertions were observed: the type 1-specific 1500-1600kbp (NCIMB 11163) insertion and type 2-specific 1300-1400 kbp (ATCC 29192 and ATCC 29191) insertion and inversions (Fig 2). The region of 800-1000 kbp in ATCC 29191 was shifted compared to the other strains. While there were a few structural rearrangements and shifts of the corresponding chromosomes in some strains, the changes did not appear to affect protein-coding genes.

Comparison of Z. mobilis-harboring plasmids
Plasmids are an important ways to obtain foreign genetic components. Analysis of the genome sequences of Z. mobilis strains showed that there were significant differences in the types and numbers of plasmids in the types and numbers, as well as the size and number of plasmids between the different Z. mobilis strains ( Table 2)  identical to the Z. mobilis NCIMB 11163 plasmid pZMO1A, which the complete genome sequences were most closely related between these strains [16].The sequences of plasmids pZZM401, pZM1252603 pZM1402303,pAZ1001,pZMOB04 and pZMOB05 were relatively similar. However, the plasmid sequences of Z. mobilis ATCC 29192 were different than the other strains. For example, plasmid pZYMOP01 of ATCC 29192 carries a CRISPR repeat region. From these results, we found a high level of homology between the complete genome of all sequences, and we found the plasmids sequences of Z. mobilis-harboring plasmids were also conserve.

Phylogenetic comparisons of whole genomes and plasmids among Z. mobilis strains
A phylogenetic tree of 8 the sequenced Z. mobilis strains was constructed based on the complete genome sequences using MMEA. The phylogenetic tree analysis indicated that NRRL B-12526 and NRRL B-14023, ZM4 and ATCC10988 gathered in a cluster, which suggests they could share a similar evolutionary path. ATCC29191 and ATCC29192 had a distant genetic relationship with the other strains. Particularly, the branch length (value 0.2525) of ATCC29192 was bigger than the others strains, which indicates that it is phylogenetically more distant (Fig 3). The phylogenetic relationships between the plasmids of all Z. mobilis strains were constructed (Fig 4). As expected, plasmids from the same cluster of strains showed close evolutionary relationships, such as plasmids pZZM403, pZM1402304, and pZM1252605. Although the strains ATCC29192 had a relationship distant from the other strains, the plasmid pZYMOP02 was clustered with plasmid pZZM402, pZM1402301 and pZM1252601.Additionally, pZYMOP01 was genetically closer to plasmid pZZM401, pZM1402303, and pZM1252603 (Fig 4).

Clustered regularly interspaced short palindromic repeats (CRISPRs)
Eight Z. mobilis strains had 3-5 types of CRIPSR repeats, CRISPR1, CRISPR2 and CRISPR3 were ubiquitous, but CRISPR4 and CRISPR5 were only present in the strains ATCC10988 and ATCC29191, respectively (Fig 5). The probability and types of phage infection may be different because of changing the living environment. Therefore, when the numbers and types of phage infection are increased, the number of CRISPR sites of in the Z. mobilis genome will be increased [41][42][43]. In the CRISPR-Cas systems of Z. mobilis strains, there were 13 different spacer profiles and 5 types repeats were identified. NRRL B-12526 and NRRL B-14023 shared the three same CRISPR repeat and spacers numbers (spacers 7, 4, and 5) (S1 Table), indicating that the strains invaded by the same phage or plasmid [43]. Interestingly, the CRISPR repeats (TTTCTAAGC TGCCTGTGCGGCAGTGAAC and GTTCACTGCCGCACAGGCAGCTTAGAAA) were common to all 8 Z. mobilis. However, the CRISPR repeats (CCAGAAATACTGCACTCGCTGTAATAGCCCC GATCTCTCAC) was only found in Z. mobilis ATCC10988; and the CRISPR repeat (CGGTTCA TCCCCGCGTGGGCGGGGAACAC) was only present in Z. mobilis ATCC29192.
Overall, there were a lot of common features among the CRISPR/Cas systems of Z. mobilis strains, which were conserved. However, the strains of ATCC29192, ATCC29191 and ATCC10988 acquired some new characteristics during evolutionary.

Core and pan-genome analysis of Z. mobili
Pan-genome refers to the sum of genes that can be found in a given bacterial species. This included the core genome (genes shared by all strains) and the dispensable genome (genes absent from one or more strains, and genes that are unique to each strain) [44][45][46]. The core genome encodes essential functions related to the basic biology of the species and genes shared by all strains [47]. The dispensable genome contributes to species' diversity and the genes that are absent from one or more strains, and the genes that are unique to each strain [47]. To understand the basic biology and population genetics of Z. mobilis, the core and dispensable genomes were analyzed by OrthoMCL version 2.0. The size of the pan-genome was 1945 genes shared by 8 strains. The core genome included 1428 genes, which account for 73.41% of CDSs of these strains. There were 517 accessory genes, including 507 dispensable genes and 10 unique genes. These accounted for 26.59% of the total CDSs from all 8 genomes. However, the distribution the core gene number in each strain varied considerably (Fig 5A). These core genes are related to carbohydrate metabolism, replication, transcription and translation. The extrapolated curve plateaus at a value of about 1900 with 8 genome sequences, and it can be inferred that the Z. mobilis had a closed pan-genome (Fig 5B). The genome structure of the strain of Z. mobilis was very conservative. There were 10 new genes found in different strains of Z. mobilis, 6 new genes in NCIMB 11163, 3 new genes in ATCC 29191 and 1 new gene in ATCC 29192 (Fig 5C). We found that these 1428 core genes were also subjected to COG functional classification (Fig 6). These 1428 core gene were more often associated with the metabolic process, catalytic binding, cellular process, transporters and biological regulation (Fig 6).

Conclusions
In summary, the comparative genomic analyses indicated that these strains shared a conserved core chromosomal backbone, and they shared quite high homology with an average of 96% sequence identity. ZM4, NRRL B-12526, NRRL B-14023, NRRL B-1960 and NCIMB 11163 shared an extremely short evolutionary relationship in the phylogenetic tree. Furthermore, strains ATCC10988, ATCC29191 and ATCC29192 had a far distant relationship. The size of the Z. mobilis pan-genome was 1945 genes, and it includes 1428 core genes and 517 accessory genes, which had a closed pan-genome. In the current study, we established a link between the sequenced complete genome sequences of Z. mobilis. We also found similarities and differences between the genomes of these strains, which indicated that Z. mobilis strains were conserved during evolutionary.