Comprehensive analysis of genomic variation, pan-genome and biosynthetic potential of Corynebacterium glutamicum strains

Corynebacterium glutamicum is a non-pathogenic species of the Corynebacteriaceae family. It has been broadly used in industrial biotechnology for the production of valuable products. Though it is widely accepted at the industrial level, knowledge about the genomic diversity of the strains is limited. Here, we investigated the comparative genomic features of the strains and pan-genomic characteristics. We also observed phylogenetic relationships among the strains based on average nucleotide identity (ANI). We found diversity between strains at the genomic and pan-genomic levels. Less than one-third of the C. glutamicum pan-genome consists of core genes and soft-core genes. Whereas, a large number of strain-specific genes covered about half of the total pan-genome. Besides, C. glutamicum pan-genome is open and expanding, which indicates the possible addition of new gene families to the pan-genome. We also investigated the distribution of biosynthetic gene clusters (BGCs) among the strains. We discovered slight variations of BGCs at the strain level. Several BGCs with the potential to express novel bioactive secondary metabolites have been identified. Therefore, by utilizing the characteristic advantages of C. glutamicum, different strains can be potential applicants for natural drug discovery.


Introduction
Corynebacterium glutamicum is a gram-positive, non-sporulating, non-pathogenic, and generally recognized as safe (GRAS) organism.It remains very robust against oxygen and substrate supply oscillation in the case of large-scale fermentations [1,2].It is one of the most used microorganisms in industrial fermentation for producing amino acids, like lysine and glutamate, for decades [3,4].C. glutamicum has undergone substantial modification to provide a wide range of beneficial products including chemicals, proteins, polymers, natural products, and biofuels [5][6][7][8].Many studies of C. glutamicum have been published in the past decade [9], yet the genetic variations among the strains are unexplored.
Whole genomes of closely related and geographically co-occurring microbial strains show enormous variation within species, resulting from allelic and gene content changes [10][11][12][13].However, it is challenging to distinguish between two lineages that are thought to be the same species yet have significantly different gene contents using conventional taxonomic approaches [14][15][16].Hence, a better understanding of the genomic characteristics of different C. glutamicum strains is required.
Genes for the production, control, and resistance of secondary metabolites are often grouped to create biosynthetic gene clusters (BGCs) in microbial genomes [17].Utilization of bioinformatics tools for the analysis of microbial genome sequences reported that a single genome may include 20-80 distinct BGCs [18].On the other hand, a microorganism may possess certain BGCs but it may not express them in laboratory conditions [19,20].Research in this area will support wet lab methods development for natural product (NPs) producing strains that have greater potential to produce new compounds [18].In 2017, Yang and Yang conducted a comparative analysis of C. glutamicum genomes, providing insights into the genetic diversity and evolutionary relationships within this significant industrial bacterium [21].The research also pinpointed crucial mutations associated with amino acid production in various genetically engineered strains.However, certain limitations and challenges persist.Specifically, the pan-genome analysis was conducted with a relatively limited number of strains, potentially not encompassing the entire spectrum of species diversity.Furthermore, the identification of BGCs remains incomplete, highlighting areas for further investigation.So, it should be helpful to use functional genomic approaches to identify those unidentified BGCs at the genomic level.Therefore, the BGCs distribution and evolutionary connections among the C. glutamicum strains need to be explored.The primary aim of this study is to analyse pan-genomic variations within different strains and explore the distribution patterns of BGCs.

Whole genome comparison
Genomic datasets of C. glutamicum strains were collected from National Center for Biotechnology Information (NCBI) database (https://www.ncbi.nlm.nih.gov/datasets,accessed on 30 th May 2022).Initially, 65 complete genome sequences of C. glutamicum strains were retrieved in addition to the reference genome (In the NCBI database, C. glutamicum SCgG2 serves as the primary reference genome), all in FASTA format.The complete whole genome sequences of C. glutamicum were selected using the NCBI genome filter tool, and the assembly level was set to "complete".The choice of genomes was guided by contemporary research, emphasizing the pivotal role that high-quality genomes play in pangenome and genome mining analyses [22].Consequently, this study excluded draft and scaffold level assemblies to ensure the integrity and reliability of the genomic data under examination.The use of complete genomes enhances the reliability and comprehensiveness of the study's findings, contributing to a more accurate understanding of the C. glutamicum's genetic diversity, functional capabilities, and evolutionary insights.Then, whole genome comparisons were executed using OrthoANI v0.5.0 with default parameters, which uses an enhanced pairwise average nucleotide identity (ANI) algorithm [23].After the comparison, we selected 30 complete genomes, other 35 were discarded due to 100% similarity match.The program was also used to clear species boundaries and to get diversity at the genetic level among whole genomes (Table 1).In this way, redundancy was avoided and the genetic diversity of C. glutamicum was ensured.

Genome annotation
The process of locating and designating all the pertinent features on a genomic sequence is known as genome annotation [40].Selected whole genome sequences were re-annotated using Prokka v1.14.6 with default parameters [41].Prokka uses BLAST+ and identifies best match of annotated protein and candidate genes from various databases [41].Prokka and FragGeneScan v1. 31 were used with default parameters to identify the number of genes in each genome [42].It uses a novel gene prediction technique and improved prediction of the protein-coding region in short reads by combining codon usages and sequencing error models in a Hidden Markov Model (HMM) [42].

Pan-genome analysis
Pan-genomic analysis was conducted utilizing Roary v3.11.2 (with default parameters), a robust computational tool specifically designed for such analyses.Roary classifies genes into distinct categories, including 'core genes', 'cloud genes', 'shell genes', and 'soft-core genes', employing a rigorous computational framework [43].Bacterial Pan-genome Analysis tool (BPGA v1.3) [44] was employed for the systematic classification of orthologous genes into core, accessory, and unique genomes.Subsequently, strains containing a relatively higher number of unique genes were subjected to annotation using the blast algorithm against the Clusters of Orthologous Genes (COG) database [45].To gain in-depth insights into the functional aspects of these genes, further analyses were conducted utilizing the blast algorithm against both the COG and Kyoto Encyclopedia of Genes and Genomes (KEGG) database [46].The estimation of the pan-genome and core genome was performed using the USEARCH v11.0.667 [47] program available in BPGA, employing a 50% sequence identity cut-off.The resulting data were then subjected to nonlinear fitting based on the model extrapolation of the pan-genome and core genome, ensuring a robust and comprehensive analysis of the bacterial genomic elements under investigation [44,48].

Phylogeny
FastTree v2.1.11(with default parameters) was used to generate phylogenetic tree, which uses the maximum-likelihood method with generalized time-reversible (GTR) models of nucleotide evolution [49].iTOL, an online platform was used to visualize the phylogenetic tree [50].

Identification of BGCs
We used three platforms to predict BGCs, which can accurately predict microbial secondary metabolite encoding regions by using sophisticated computer model services [51].These are namely antiSMASH 6 (https://antismash.secondarymetabolites.org/,accessed on 9 th , June 2022) [52], PRISM 4 (http://prism.adapsyn.com,accessed, accessed on 28 th , June 2022) [53] and BAGEL4 (http://bagel4.molgenrug.nl,accessed on 29 th , June 2022) [54].BGC boundaries in this study was detected using antiSMASH 6, a computational tool that employs several techniques.Firstly, antiSMASH determines BGC boundaries based on the physical distance to core domains within the analyzed sequences [55].It utilizes ClusterCompare output, conducting a search of all gene products against a database comprising highly conserved enzyme Hidden Markov Model (HMM) profiles indicative of specific BGC types [56].The tool applies pre-defined cluster rules to identify individual protoclusters encoded in the genomic region.
To standardize gene locations, antiSMASH employs a reference genome as a common coordinate system, allowing for the normalization of gene positions.Additionally, antiSMASH maps genomes of other strains containing the same or similar BGCs to the reference genome through alignment tools.This enables the identification and comparison of genomic regions corresponding to the BGCs across different strains in relation to the reference genome [52].PRISM 4 predicts BGCs by analysing open reading frames from various databases [53].BAGEL4 identifies ribosomally synthesized and post-translationally modified peptides (RiPPs), and Bacteriocin.It discovers gene clusters by using peptide database and/or through HMM motifs that are present in relevant contextual genes, augmented with literature references and links to UniProt and NCBI [54].

Genomic analysis and single nucleotide polymorphism identification
Genome comparisons among C. glutamicum strains were conducted using BLAST Ring Image Generator (BRIG-0.95-dist) with default settings.BRIG plays a pivotal role in facilitating the assessment of genotypic distinctions within closely related prokaryotic organisms [57].In this study, we utilized the Mauve genome alignment system to analyze C. glutamicum strains [58].Throughout evolution, microbial genomes can experience substantial mutations, including rearrangements and lateral transfers, leading to notable differences in gene order and content among closely related organisms.Mauve, a powerful tool, was employed to identify these events, enabling comprehensive comparisons of multiple microbial genomes, even in the presence of high recombination rates.The Mauve system was configured with default settings, employing deed weight, full alignment, and iterative refinement techniques.
In our study, we utilized single nucleotide polymorphism (SNP) analysis as a methodology to discern genetic variations within the strains of C. glutamicum.The identification of variants among these strains was conducted through the implementation of Snippy v4.6.0 [59], with the reference sequence being C. glutamicum SCgG2.Notably, the prediction of Core SNPs was an additional aspect addressed in our analysis, employing the same Snippy tool for this specific task.This comprehensive approach allowed for a detailed exploration of genetic diversity and core variations within the C. glutamicum strains under investigation.

Identification of horizontal gene transfer
The prediction of horizontally transferred genes was carried out using HGTector v2.0b3 (with default settings) [60].The analysis focused on identifying horizontal gene transfer (HGT) events within C. glutamicum AJ1511 and C. glutamicum AR1 genomes.A search was conducted utilizing the default remote database with stringent criteria, requiring a minimum identity and coverage of greater than 50%.The analysis was executed with default parameters to ensure comprehensive and accurate detection of potential HGT events in the studied strains.

Pathogenic and non-pathogenic properties and plasmid typing
The prediction of pathogenicity for the chosen strains was carried out using the PathogenFinder web tool [61].This tool employs a predictive model that considers both the probability score and the resemblance to known pathogenic species in order to assess the likelihood of pathogenicity.
The plasmid sequences were obtained from the NCBI database.To ascertain the classification of plasmids, Plasmid Multi-Locus Sequence Typing (Plasmid MLST) was employed.Plasmid MLST is a molecular typing method that analyzes specific genetic markers across plasmids, providing insights into their type and lineage.This approach aids in categorizing plasmids based on their sequence diversity and assists in understanding the genetic variation and relationships among different plasmid strains.

Results
Demographic information about the strains used in this study are listed in Table 1.Among the strains, 11 were isolated from soil and others were isolated from air, mucus, rotten onion and lab strains.Among them 20 strains were isolated from Asian countries, 2 strains were isolated from Germany, and 1 strain from Portugal and United States of America each.Others origin are unknown.

Whole genome comparison
The degree of relatedness in the studied strains were identified by calculating ANI.ANI also clarifies whether the genomes reside in the same species by a cut-off values of � 95% for same species.Our studied genomes have shown higher than 97% ANI values, confirming that all the genomes of the strains belong to the same species (S1 Table ).A heat-map generated from the ANI scores have shown (Fig 1).There are five sub-groups in the heat-map and can be called as five clades.The clades were extracted from pairwise ANIs by using a hierarchical clustering algorithm with a cut-off value of 0.5.This means that strains with ANI values higher than 0.5 were grouped together in the same clade.The clades do not seem to have a strong correlation with the source and geographic location of the genomes.For example, clade 1 contains strains from soil and mucus sources, and from China and USA locations.Clade 2 contains strains from soil sources, and from Germany, China, South Korea, and Portugal locations.Clade 3 contains strains from soil, air and lab sources, and from China and Japan locations.Clade 4 contains strains from soil and rotten onion sources, and from China and South Korea locations.Clade 5 contains strains from soil sources, and from South Korea locations.Strains belonging to clade 1 (R, SCgG1, SCgG2) exhibit big genome size, a notable presence of multiple copies of NAPAA biosynthetic gene clusters (BGCs) and concurrently possess betalactone BGCs.This characteristic occurrence may contribute to their distinctiveness as outliers within the broader spectrum of analyzed genomes.
Fig 1B represents the ANI comparisons among various C. glutamicum strains isolated from different sources, including soil and non-soil environments.The ANI values revealed significant insights into the genetic relationships among these strains, shedding light on the impact of isolation sources on their genetic similarity.When we performed a more detailed ANI analysis, we observed that the strains isolated from soil environments, such as C. glutamicum XV, C. glutamicum ZL, and C. glutamicum YI, exhibited ANI values close to 98%, indicating a high genetic similarity.This suggests a common genetic background among these soil-isolated strains.On the other hand, when comparing soil-isolated strains with those from non-soil sources, the ANI values were notably lower, hovering around 97%.This discrepancy underscores the genetic divergence between strains from soil and non-soil origins.Such divergence could potentially be attributed to environmental factors and selective pressures specific to these habitats, leading to genetic adaptations unique to each niche.

Comparative genomic features of C. glutamicum strains
The average genome size of C. glutamicum strains was 3.24 Mbp (ranging from 2.84 Mbp to  ).The average GC content was 54.15% among the genomes, and the approximate number of tRNA genes ranged from 57 to 65, while the predicted rRNA genes were 18 among all 28 strains excluding strain TCCC11822 and strain YI having 15 rRNA genes (Fig 2B and S2 Table).

Mbp) (Fig 2A and S2 Table). Coding sequence (CDS) count was predicted with the
The gene count, as determined by Prokka, displayed a range of 2688 to 3281 genes, with a mean value of 3078 genes per genome.In contrast, gene predictions through FragGeneScan exhibited a range of 2778 to 3369 genes, yielding a mean of 3197 genes per genome.It is noteworthy that Prokka's predictions resulted in a comparatively lower gene count than those obtained via FragGeneScan (Fig 2C and S2 Table ).
The high number of cloud genes exhibit significant variation and shows the 'open' nature of the C. glutamicum pan-genome (Fig 3B).The pan-genome of C. glutamicum was analysed using an empirical power law regression function based on the Allometric1 model (f(x) = 3059.17x 0.136303 ).The obtained parameter exponent (0.136303), falling between 0 and 1 and indicates that the pan-genome grows more slowly than other bacteria (possibly due to slower genetic diversification), but will grow indefinitely nonetheless (Fig 3C).In the context of Heaps' law, an 'open' pan-genome suggests the presence of a substantial and indeterminate number of additional genes, with its size potentially increasing boundlessly as more strains are included in the analysis [62][63][64].C. glutamicum strains TQ2223, ATCC 13032, and HA exhibit a relatively low GC content coupled with a notable abundance of accessory genes (883, 925, and 924, respectively).Among these strains, C. glutamicum SCgG2 displays the lowest GC content and concurrently possesses the highest number of accessory genes ( The core genome is primarily associated with essential biological functions such as amino acid transport and metabolism, translation, ribosomal structure and biogenesis, transcription, carbohydrate transport and metabolism, inorganic ion transport and metabolism, and posttranslational modification, protein turnover, and chaperones.Simultaneously, the number of unique genes within C. glutamicum genomes varied significantly, indicating individual differences and a relatively high level of genomic diversity.This variability suggests their potential adaptation to diverse and extreme environments.Furthermore, KEGG pathway analysis revealed that these unique genes are involved in various biological processes related to metabolism, environmental information processing, and cellular processes.
Correlation between BGCs number with genome size and total gene count indicates a moderate positive correlation (R 2 = 0.349 and R 2 = 0.358 respectively) (

Genomic and SNP analysis
In this study, we employed BRIG-0.95 for comprehensive genome comparisons among various strains of C. glutamicum.The reference genome, C. glutamicum SCgG2, was utilized as a baseline for these comparisons.Notably, a substantial portion of genes present in SCgG2 were found to be shared by the other strains, indicating a core genomic similarity among these strains.However, a detailed examination of the genomic alignments revealed significant disparities between SCgG2 and other strains, as denoted by white gaps in Fig 9 .These gaps signify regions where genes were absent in certain strains, indicating potential genetic variations.Such discrepancies could be attributed to the integration of mobile genetic elements, horizontal gene transfer events, or recombination phenomena.These mechanisms are known to drive genetic diversification in bacterial populations, leading to the acquisition or loss of specific genes over evolutionary time.The identification of these genomic variances underscores the dynamic nature of C. glutamicum genomes and highlights the genomic plasticity within this bacterial species.
Fig 10 illustrates the output from pairwise whole-genome Mauve alignments, confirming the presence of significant structural variations among the genomes of the analysed strains.In each comparison, matching coloured blocks and connecting lines delineate homologous genome sections between the compared pairs.Notably, strains TCCC11822, TQ2223, BCA, CR101, HA, and ATCC 21573 exhibited the most significant variations, indicating diverse genomic structures within these strains.These visual cues provide insights into the shared genomic regions and structural differences between the analysed strains.SNPs analysis within the various C. glutamicum strains provided valuable insights into the genetic diversity of these strains.Table 5 presents a comprehensive analysis of genetic variants among various strains of C. glutamicum.Notably, C. glutamicum USDA-ARS-US-MARC-56828 exhibited the highest number of total variants (41270), characterized by substantial counts in complex variants (7908) and SNPs (32960).This strain displayed a significant divergence compared to others.Conversely, strains like C. glutamicum SCgG1 showed minimal variants, with only 28 total variants.Several strains, such as C. glutamicum R, displayed a relatively low total variant count (20433) and a notable prevalence of deletions (141) and insertions (130).These findings underscore the genetic diversity within C. glutamicum strains, with certain strains exhibiting distinctive patterns of variation, potentially influencing their biological characteristics.The presence of unique SNPs in each strain suggests specific genomic changes, potentially influencing their functional attributes and ecological roles.
The phylogenetic tree, as illustrated in Fig 11 based on Core SNPs analysis, delineates the evolutionary relationships among the C. glutamicum strains.The tree is rooted with a reference strain (SCgG2).Noteworthy patterns emerge, revealing distinct clusters and branches that denote genetic proximity.For instance, strains like AJ1511, WM001, and TCCC11822 form a cluster, suggesting a shared genetic ancestry.Similarly, ZL-6 and ATCC 21799 exhibit close genetic relatedness.The tree also portrays a bifurcation between B253 and its cluster, including BE, ATCC 14067, and YI, reflecting their divergence.Further branching showcases the genetic relationships among diverse strains, emphasizing the intricate evolutionary dynamics within the C. glutamicum species.The placement of the reference strain in the analysis enables a comparative understanding of genetic variations, highlighting its pivotal role in contextualizing the evolutionary history of the examined strains.Overall, the phylogenetic tree provides a visual representation of the genetic distances and relationships, offering valuable insights into the evolutionary landscape of C. glutamicum.

Horizontal gene transfer
Utilizing the HGTector tool, an exhaustive analysis was performed, revealing a substantial number of HGT events within the genomes of C. glutamicum strains.Specifically, in the AJ1511 strain, 684 distinct HGT events were identified from a dataset of 3014 predicted proteins.These events were predominantly sourced from Actinomycetes (71%) and to a lesser extent, Micrococcales (21%).Similarly, in the AR1 strain, a total of 237 genes were predicted to have undergone HGT events out of 2759 proteins analysed.Notably, the majority of these events were attributed to Actinomycetes (73%), with a smaller fraction originating from Micrococcales (23%) as illustrated in Fig 12 .Considering the prevalence of HGT events in AJ1511 and AR1, it is likely that other C. glutamicum strains, would reveal a mosaic of genetic origins.The genomic plasticity observed in these two strains is indicative of the adaptive strategies employed by C. glutamicum populations, emphasizing the role of HGT in shaping their genetic repertoire.

Pathogenicity, virulence properties and plasmid analysis
The investigation revealed that none of the strains belonging to C. glutamicum exhibited characteristics indicative of human pathogenicity.A detailed presentation of these findings is  encapsulated in the Table 6.This underscores the non-pathogenic nature of the examined C. glutamicum strains concerning human health.It is noteworthy that non-pathogenic bacteria lack the genetic elements associated with virulence, thereby affirming their incapacity to induce infections or diseases in humans.The plasmid analysis across different strains of C. glutamicum revealed diverse characteristics.Strains CP, XV, B253, USDA-ARS-USMARC-56828, AR1, ATCC_21831, and ATCC_13869 were found to harbor IncA/C type plasmids, with varying lengths and GC content (Table 7).Notably, strains R and CGMCC1.15647exhibited distinct plasmid types, namely IncI1 and IncHI1, respectively, and displayed substantial variations in plasmid sizes.The gene content of these plasmids varied among strains, encompassing differences in coding sequences (CDs), pseudo genes, CRISPR arrays, rRNAs, tRNAs, ncRNA, and frameshifted genes.Among the strains analyzed, 10 were reported to carry single plasmids, while C. glutamicum CGMCC1.15647 was unique with two plasmids.The prevalent IncA/C type plasmid, found in the majority of strains, is known for its role in modulating changes to bacterial host chromosomes.In contrast, C. glutamicum R carries an IncI1 type plasmid, responsible for encoding sex pili in bacteria.IncHI1 type plasmid is associated with antibiotic resistance.This comprehensive analysis underscores the diversity and functional significance of plasmids in C. glutamicum strains.

Discussion
Whole genome comparison by ANI calculation revealed high degree of relatedness between C. glutamicum strains.ANI computation with a higher than 97% score verifies that our studied genomes belong to the same species and are closely related.ANI comparison between Corynebacterium cystitidis strains showed a 95.1% score when isolated from the different hosts but showed a >99% score when isolated from the same host [65].Our demographic data also support a >97% score since most of our strains are from soil sources.The average genome size (3.24Mbp) of the strains was slightly high, compared with nonpathogenic C. casei LMG S-19264 (3.11 Mbp) and C. efficiens YS-314 (3.15 Mbp) [66].Moreover, the average number of genes (3197) was also higher than C. casei LMG S-19264 (2872) and C. efficiens YS-314 (3064) [66].On the other hand, the average GC content was lower among C. glutamicum strains (54.15%) than other non-pathogenic C. variabile DSM 44702 (76.1%) and C. efficiens YS-314 (69.93%) [67].We found variation in tRNA coding genes among the C. glutamicum strains, since the tRNA genes varied from 57 to 65 among the strains.Whereas, C. variabile DSM 44702 and C. efficiens YS-314 contains 59 and 56 tRNA  genes respectively [67].Additionally, C. glutamicum strains possess more rRNA genes (15-18 rRNA genes) compared with other non-pathogenic Brevibacterium auranticum strains and Brevibacterium linens ATCC 19391 (12 rRNA genes) [68].Besides, the average CDS among the strains was 3007, comparatively higher than C. efficiens YS-314 (2950 CDS) [69].
Pan-genome study of Corynebacterium at genus level showed very low number of core genes [66,67].Analysis between 51 strains of various pathogenic and non-pathogenic species of Corynebacterium genus showed 8.69% of core genes [66].Similarly, study of eleven Corynebacterium species showed 6.68% of core genes [67].Contrary to genus level, we found core genes of 29.1% at sub-species level among C. glutamicum strains, which is somewhat higher than C. pseudotuberculosis core genes (26.1%) at sub-species level [70].The number of cloud genes (strain-specific genes) was considerably large and covered 47.78% of the pan-genome, similar to C. pseudotuberculosis cloud genes (42.34%) [70].The low percentage of core genes in C. glutamicum species likely results from a combination of factors such as horizontal gene transfer, adaptation to diverse environments, evolutionary divergence, and specialization.From an evolutionary perspective, this genetic diversity contributes to the species' ability to adapt, survive, and thrive in different ecological niches.Which strongly demonstrates the diversity among the strains.Large accessory genomes and a high number of strain-specific genes are frequently linked to horizontal gene transfer (HGT) in microorganisms [71].Besides, we found low GC content of C. glutamicum strains with other non-pathogenic species of Corynebacterium genus.Our study also suggests a clear inverse relation between the abundance of accessory genes and the genomic GC content.Specifically, as the GC percentage increases, there is a notable decrease in the number of accessory genes observed.This finding supports the idea of possible relation of low GC content with horizontal gene transfer and codon reassignment of C. glutamicum [72][73][74][75].
Our study shows the open nature of the C. glutamicum pan-genome, which indicates that new gene families continuously will be added to the pan-genome.The open pan-genome of Corynebacterium at genus level was also reported by the pan-genomic analysis of 40 strains of eleven different Corynebacterium species [76].Thus, the pan-genome of C. glutamicum indicates the diversity of the gene pool and the likeliness of increasing gene number.
Another objective of our study was to uncover the diversity and distribution of BGCs among the strains.Although BGCs producing metabolic products remained undocumented, predictions based on bioinformatics revealed that several of them might encode products with unique structures [77][78][79].Thus, our computational approaches were to predict BGCs as a screening process for new bioactive compound production, which are to be effectively applied in the wet laboratories.
NAPAA of Nonribosomal peptide synthetases (NRPs) gene clusters and T1PKS of Polyketide synthases (PKSs) gene clusters were found in all the studied strains.Additionally, Terpene BGCs were found in 96.67% strains.T1PKS, Terpene, NAPAA and other NRPs were also most common in Gordonia hongkongensis EUFUS-Z298 [80], Burkholderia spp.[18], in activated sludge microbiome [81], and in Ktedonobacteria [82].NAPAA, particularly e-poly-lysine, demonstrate notable antimicrobial efficacy, showcasing widespread utility in the food and pharmaceutical sectors.Conversely, T1PKS harbor the capability to biosynthesize peptides with antibiotic and antitumor properties.Terpenoids exhibit robust and specific biological activities, notably against diseases such as cancer and malaria.The consistency of limited number of BGCs among closely related bacterial population was previously reported [83].Which indicates that BGCs 'fixation' can be occurred as a strong positive selection and to survive specific environment by the activity of encoded products [17].The novel BCGs identified from the strains used for analysis include betalactone and lanthipeptide class IV BGCs.Betalactone BGCs were predicted in strain R, B253, SCgG1, and SCgG2, while lanthipeptide class IV BGCs were only found in strain B253.Betalactones manifest noteworthy bioactivity against bacteria, fungi, and cancer cell lines.Lanthipeptides, belonging to the subclass of ribosomally-synthesized and posttranslationally-modified peptides (RiPPs), generally display feeble antibacterial activities, with Lenthipeptide-class-IV standing out as a noteworthy example.A study of Bacillus cereus strains identified different lanthipeptide classes, and concluded that several lanthipeptide classes can evolve independently, and most of the lanthipeptide BGCs can originated from intra-species horizontal gene transfer [84].
Additionally, PKS and NRPs BGCs which were most common in our studied genomes, are considered as representatives of two major classes of antibiotics [80].Kalimantacin antibiotics with strong antistaphylococcal effect, from Alcaligenes species YL-02632S [85,86] and antibiotic batumin from Pseudomonas batumici have been produced utilizing these BGCs [87].C. glutamicum is suitable for T1PKS and NRPs synthesis by heterologous expression since it possesses endogenous 4'-phosphopantetheinyl transferase (PPTase), PptAcg [88].Roseoflavin, a broad-spectrum antibiotic was already produced using C. glutamicum via the heterologous expression of its BGCs [89].We also found bacteriocin gene clusters in 12 strains of C. glutamicum.Bacteriocins have been seen as a feasible alternative to traditional antibiotics because of their distinct antibacterial processes.Besides, it can be used as innovative carrier molecules [90] and also as plant growth-promoting agent, antiviral agent, and anti-cancer agents [91].
Whole genome comparison based on ANI scores also revealed the phylogenetic relationship among the strains.We divided all 30 strains into five clades.Clade 1 with five strains, clade 2 with seven strains, clade 3 with eight strains, clade 4 with four strains, and clade 5 with six strains.We have seen diversity of the BGCs among clade 1, clade 2, and clade 4. Whereas members of clade 3 and clade 5 contain the same number of BGCs, but these two clades harbour different BGCs.We observed similar BGCs among the soil isolated strain CICC10064, B414, and TQ2223.Similarly, soil isolated strain XV, ZL-6, YI, TCCC11822, ATCC 13869, and WM001 have similar BGCs, where strain YI have gained extra NAPAA class.Soil isolated strain SCgG1 and ScgG2 have similar BGCs class with betalactone.On the other hand, strain C1, which is an engineered derivative of ATCC 13032 have lost double Terpene BGCs.
Additionally, we identified NAPAA-betalactone hybrid BGCs among strain B253, R, SCgG1 and SCgG2.Hybrid BGCs encodes genes that are responsible for multiple scaffold-synthesizing enzymes [92,93].Occurrence of hybrid BGCs are common for some bacteria (98% occurrence in Streptomyces) [94], yet the exact roles of hybrid BGCs are not completely known [95,96].It is noteworthy, that the specific locations of these hybrid BGCs within the genomes of these strains exhibit variation, as illustrated in Fig 8.This disparity implies that these hybrid BGCs might have undergone acquisition or rearrangement through horizontal gene transfer or recombination events, thereby contributing to genomic diversity across the strains.Consequently, our assertion of identifying hybrid BGCs is rooted in their gene content and functional characteristics, rather than their precise physical placement within the genomes.
We found that the number of BGCs is positively correlated with the genome size and the gene number of the strains.Strain SCgG1, ScgG2, BE, YI, 14067, and strain R with larger genome size and with high number of genes, each harbouring 5 BGCs in their genomes.Though, strain CGMCC1.15647 with the highest gene number and the largest genome size contains 4 BGCs.Thus, our correlation regression analysis shows that if the genome size and the gene number increase, the number of BGCs is more likely to increase.Generally, strains with larger genomes tend to exhibit a higher number of BGCs, a phenomenon attributed to the potential accumulation of accessory genes and genomic islands carrying BGCs [97].
The potential presence of sequencing errors within publicly available databases remains a notable concern.Only complete genome sequences of C. glutamicum strains, which enhance the reliability and comprehensiveness of the study's findings, were considered, addressing the potential presence of sequencing errors within publicly available databases.Prokka and Frag-GeneScan were employed for genome annotation and gene prediction, representing widely used and validated tools for prokaryotic genomes.To ensure robust genome comparison and species delineation, OrthoANI, a pairwise average nucleotide identity (ANI) algorithm, more robust and accurate than traditional methods, was utilized.The pan-genome analysis employed Roary, BPGA, and USEARCH, utilizing rigorous computational frameworks and sequence identity cut-offs for gene classification and estimation.This comprehensive approach aimed to mitigate concerns related to sequencing errors, enhance reliability, and employ validated tools for effective genome annotation and pan-genome analysis in the study of C. glutamicum strains.Nevertheless, our investigation has revealed discernible diversity across various genomic features among the strains, along with variations in the abundance of biosynthetic gene clusters (BGCs) within their genomes.Virulence genes are pivotal elements that contribute to the pathogenicity of microorganisms, enabling them to induce diseases.In contrast, BGCs are typically responsible for encoding enzymes and proteins involved in synthesizing specific secondary metabolites, such as T1PKS, Terpene, NAPAA, betalactone, and lanthipeptide.The connection between BGCs and virulence is diverse, as certain secondary metabolites produced by BGCs can influence the virulence of microorganisms.
However, in our investigation, no identifiable secondary metabolites produced by BGCs were associated with virulence.Remarkably, all examined strains were found to be non-pathogenic.This suggests that there might be an absence of virulence genes located within the BGCs of these strains.The collective non-pathogenic nature of the strains reinforces the notion that the BGCs under scrutiny may not harbor genes contributing to virulence, further emphasizing the safety profile of these microorganisms in the context of human health.
While our study successfully identified numerous distinct polymorphic sites among the strains under investigation, it is crucial to acknowledge a limitation.The specific interaction or overlap between these polymorphic sites and BGCs in C. glutamicum has not been thoroughly explored within the scope of our research.This unexplored aspect represents a noteworthy limitation, suggesting a promising avenue for future investigation.
In all, we can say that strains of C. glutamicum can be a good candidate for engineering to produce various novel compound through BGCs expression.Also the strain may have potential to produce antibiotic, plant growth promoting agent, antiviral agent and anti-cancer agent.

Conclusions
Our objectives of the study were to elucidate the genetic variation, pan-genomic characteristics, and distribution of BGCs among 30 strains of C. glutamicum.We observed genetic variation and diversity in the BGCs distribution.Pan-genomic study of C. glutamicum strains revealed diversity at the sub-species level.We found a large number of strain-specific genes and the open nature of the C. glutamicum pan-genome.This study has yielded valuable insights into previously unexplored biosynthetic gene clusters (BGCs) that play a role in the production of betalactones, lanthipeptides, and NAPAA-betalactone hybrids.Thus, we conclude that various strains of C. glutamicum should be on focus for the discovery of natural drugs at the industrial level.

Fig 1 .
Fig 1. (A) ANI based whole genome comparison of C. glutamicum strains.The linkage method was average linkage, which calculates the average distance between all pairs of points in two clusters.The distance metric was Euclidean distance, which measures the straight-line distance between two points in a multidimensional space.The distance threshold was 0.5, which means that clusters with a distance less than or equal to 0.5 were merged together.This resulted in five clades, as shown by the horizontal dashed line in the plot.(B) ANI comparisons conducted among strains isolated from soil environments and strains isolated from both soil and non-soil environments within C. glutamicum.https://doi.org/10.1371/journal.pone.0299588.g001 Fig 3D).In Fig 4A and 4B, the distribution of COG and KEGG categories for core, accessory, and unique genes is illustrated.Fig 4C displays the phylogenetic relationships among C. glutamicum strains based on core genes.

Fig 6 ).
The diversity of BGCs among the strains with phylogenetic relationship were shown in five clades (Fig 7).Additionally, strain B253, R, SCgG1, and SCgG2 contain hybrid BGCs.All four strains contained hybrid BGCs comprised with NAPAA and betalactone.But the locations of NAPAAbetalactone hybrid BGCs are different in the genomes.The locations are 256574-294301 base pairs in strain B253, 334064-369207 base pairs in strain R, 319,462-354,607 base pairs in strain SCgG1, and 319,463-354,608 base pairs in SCgG2 (Fig 8).

Fig 7 .Fig 8 .
Fig 7. Major classes of BGCs in the genomes of C. glutamicum strains with phylogenetic distribution.These BGC classes are categorized into five clades, each delineated based on their specific biosynthetic gene content.https://doi.org/10.1371/journal.pone.0299588.g007

Fig 12 .
Fig 12. HGT events in C. glutamicum AJ1511 and AR1 Strains.(A) Scatter plot illustrating horizontally transferred genes in AJ1511 (Colour dots represents horizontally transferred genes and colourless dots represents native genes).(B) Distribution of donor organisms and the corresponding number of genes transferred in AJ1511.(C) Scatter plot showcasing horizontally transferred genes in AR1 (Colour dots represents horizontally transferred genes and colourless dots represents native genes).(D) Distribution of donor organisms and the corresponding number of genes transferred in AR1.(In the scatter plots, coloured dots represent genes transferred through HGT).https://doi.org/10.1371/journal.pone.0299588.g012

Table 2 . Predicted terpene BGCs in different clades of C. glutamicum strains.
PRISM 4 identified 4 major classes of BGCs which were polyketide, nonribosomal peptide, dehydratase, class II/III confident bacteriocin.Polyketide and nonribosomal peptide BGCs were present in all strains, while dehydratase were found in 21 strains and class II/III confident bacteriocin were found in 12 strains of C. glutamicum (S5 Table).

Table ) .
Our identified BGCs from different online platform for each strain is listed in

Table 4 . Different hits of BGCs from different genome mining tools using C. glutamicum genomes. Species name Isolation source Genome size in Mbp AntiSMASH Hit Prism Hit BAGEL Hit
https://doi.org/10.1371/journal.pone.0299588.t004