Comparative genome analysis of Salmonella enterica serovar Gallinarum biovars Pullorum and Gallinarum decodes strain specific genes

Salmonella enterica serovar Gallinarum biovar Pullorum (bvP) and biovar Gallinarum (bvG) are the etiological agents of pullorum disease (PD) and fowl typhoid (FT) respectively, which cause huge economic losses to poultry industry especially in developing countries including India. Vaccination and biosecurity measures are currently being employed to control and reduce the S. Gallinarum infections. High endemicity, poor implementation of hygiene and lack of effective vaccines pose challenges in prevention and control of disease in intensively maintained poultry flocks. Comparative genome analysis unravels similarities and dissimilarities thus facilitating identification of genomic features that aids in pathogenesis, niche adaptation and in tracing of evolutionary history. The present investigation was carried out to assess the genotypic differences amongst S.enterica serovar Gallinarum strains including Indian strain S. Gallinarum Sal40 VTCCBAA614. The comparative genome analysis revealed an open pan-genome consisting of 5091 coding sequence (CDS) with 3270 CDS belonging to core-genome, 1254 CDS to dispensable genome and strain specific genes i.e. singletons ranging from 3 to 102 amongst the analyzed strains. Moreover, the investigated strains exhibited diversity in genomic features such as virulence factors, genomic islands, prophage regions, toxin-antitoxin cassettes, and acquired antimicrobial resistance genes. Core genome identified in the study can give important leads in the direction of design of rapid and reliable diagnostics, and vaccine design for effective infection control as well as eradication. Additionally, the identified genetic differences among the S. enterica serovar Gallinarum strains could be used for bacterial typing, structure based inhibitor development by future experimental investigations on the data generated.


Introduction
Pullorum disease (PD) and fowl typhoid (FT) are two distinct septicaemic diseases caused by non-motile Salmonella enterica subsp. enterica serovar Gallinarum (S. Gallinarum) biovar Pullorum (bvP) and biovar Gallinarum (bvG), respectively, which exhibit host-specificity towards poultry and aquatic birds [1][2][3]. The diseases caused by these invasive avian pathogens cause high morbidity and acute mortality in poultry in India and various countries of Asia, Africa and South America. Pullorum disease (PD) occurs in young birds and persists for long periods in spleen and reproductive tract which leads to high mortality [4]. The disease is characterised by white diarrhoea chaperoned with increased antimicrobial resistance (AMR) and high infection rates [5,6]. On the other hand, FT can affect birds of all ages but primarily occurs in adult birds and results in variable morbidity depending on the age, species, and breed of the bird [7]. Salmonella Enteritidis, a major food-borne pathogen causes infection which leads to enteritis in multiple hosts in addition to poultry [8]. Although S. Gallinarum has negligible importance in humans, S. Enteritidis is an important zoonoses [8,9]. The S. Gallinarum, in addition has been found to be a recently evolved ancestors of S. Enteritidis [10]. The virulence of Salmonella spp. is mediated by an arsenal of genes which are capable of invasion, replication, and colonization inside the host cells [11,12]. The genetic factors involved in pathogenesis of FT and PD at molecular and cellular mechanisms are still under elucidation [2,13]. The genomic sources of virulence systems in Salmonella enterica serovars are mainly divided into two elements, one of which is horizontally acquired chromosomally located mobile genetic elements known as Salmonella pathogenicity Islands (SPIs), and prophage elements, whereas other is plasmids associated with salmonellae [14]. These genetic elements have played a seminal role in various ecological niche adaptations in different host and pathogenicity life-style in Salmonella by encoding for proteins carrying out cell-adherence, cellular invasion at host level, induction of innate immune and/or pathophysiological responses to infection [15,16]. Two pathogenicity islands, Salmonella pathogenicity island 1 (SPI-1) and SPI-2 have been described, which play roles in mediating disease by Salmonella enterica through their respective type III secretion systems (TTSS). SPI which encode T6SSs have been detected in SPI6, SPI19, 20, 21 in S. Gallinarum and S. Enteritidis [17].
Additionally, prophage lysogeny is a rich contributor of Salmonella genome diversity, and their acquisition leads to enhanced virulence and pathogenicity [18,19]. The prophage elements are being increasingly utilised as molecular markers for strain discrimination [20,21]. Another crucial genetic element that is involved in persistence, virulence and AMR includes toxin-antitoxin system (TA system) [22][23][24]. Furthermore, the Salmonella serovars including the Gallinarum and Enteritidis are increasingly reporting incidence of antimicrobial resistance (AMR) [25,26]. The comparative genomic analysis of pathogens helps in elucidating the diversity and depth of functional set of genes in form of pan-genome and decoding of evolutionary history [27,28]. Moreover, it aids in functional annotation, as well as decodes molecular mechanisms underlying pathogenesis and niche adaptation by precise measurement of genetic variation within and between pathogenic groups [29][30][31].
Recently we determined the first draft genome sequence of an Indian strain of S. Gallinarum VTCCBAA614 isolated from diseased poultry [32]. The present investigation dealt with whole genome characterization and assessment of genotypic differences amongst bvG and bvP strains including in house S. Gallinarum Sal40 VTCCBAA614 strain ( Table 1). The study unravelled pan-genome, core-genome, dispensable genome, and strain specific genes of the analyzed S. Gallinarum strains. Moreover, the investigation decoded diversity in virulence factors, genomic islands, prophage regions, TA systems, and acquired AMR genes in the investigated strains via employment of various computational tools. The genotypic differences revealed by the study will serve as the genomic resource for the identification and discrimination of biovars, and could form the basis of structure based inhibitors design and that would be beneficial to poultry industry. Although comparative genomics analysis of host-adapted salmonellae genomes have elucidated various evolutionary and virulence themes [5,8], this is the first report for the assessment of genotypic differences, which includes an Indian strain among the investigated strains from America and China.

Data collection of Salmonella genomes
The complete genome sequences of eight S. Gallinarum strains i.e., S. Gallinarum str. 287/91 (NC_011274.1), S. Gallinarum str. 9184 (NZ_CP019035.1), S. Pullorum str. ATCC 9120 (NZ_CP012347.1), S. Pullorum str. S06004 (NC_021984.1), S. Pullorum QJ-2D-Sal (NZ_CP022963.1), S. Gallinarum/pullorum str. CDC1983-67 (NC_022221.1), and S. Gallinarum/pullorum str. RKS5078 (NC_016831.1) including draft genome sequence of S. Gallinarum Strain VTCCBAA614 (isolated from chicken in India from FT inflicted broiler flock) were mined from NCBI database to decode genotypic differences among the S. enterica serovar Gallinarum strains. In addition the genome sequence of S. Enteritidis str. P125109 (NC_011294.1) [10] a well studied pathogenic Salmonella strain was also retrieved from NCBI for employment as reference for comparative genome analysis (Table 1). Sequence quality of genome assemblies is dependent on the sequencing technology used, genome coverage and aim of the sequencing. Information regarding the investigated genomes such as sequencing technology used, genome coverage, as well as assembly method was mined from literature as well as NCBI database and is listed in S2 Table. In addition, measures of genome quality such as completeness, contamination, coarse consistency, and fine consistency as identified by EvalG and EvalCon tools of PATRIC database are also listed (S2 Table) The Indian strain S. Gallinarum VTCCBAA614 [32] is available from our culture collection at NCVTC, NRCE Hisar. All the sequences were extracted in fasta and GenBank (gbk) format.

Genome statistics and visualization
Genomic features such as number of CDS, tRNA, tmRNA, rRNA and repeat region (rr) were determined by running Prokka pipeline at UseGalaxy webinterface (

Pan-and core-genome calculations
Pan-genome is the complete set of orthologous genes (OGGs) harboured within a collection of investigated genomes, whereas, the core-genome refers to the set of genes present in all the genomes of a collection. On the other hand, dispensable genome refers to the set of genes harboured by one or a subset of investigated genomes.

Functional annotation of pan-and core-genome
The functional annotation of core-genome and complete set of strain-specific genes via orthology assignment was carried by eggNOG-mapper v2 online portal (http://eggnog-mapper. embl.de) [39,40]. The tool employs precomputed clusters and phylogenies from its in house database and provides orthology assignment to a large set of sequences via fast orthology mapping. Out of the total 29,430 CDS that comprised core-genome of nine investigated Salmonella genomes, 29,125 (98.96%) CDS were queried by eggNOG-mapper v2 for orthology mapping.
On the other side, 153 (48.11%) out of the total 318 strain specific CDS (singletons) were queried for orthology mapping.

Phylogenetic analysis of the strains
The identified core-genome of the investigated Salmonella strains was employed by EDGAR to decipher phylogenetic relationship among the strains. Alignment of each CDS set was obtained by MUSCLE [41], and then further joined to form one huge alignment. The FastTree software was used to construct the tree. Program employs Shimodaira-Hasegawa (SH) branch support values to verify the tree topology [42]. Conservation of gene order and genome rearrangements among Salmonella genomes were also explored by using EDGAR wherein S. Enteritidis P125109 was chosen as a reference to create synteny plots. The Indian strain, bvG VTCCBAA614, was not included, as the use of draft genomes is not recommended in synteny plot representation.

Detection of virulence factors
VFanalyzer tool available at VFDB (virulence factor database) was utilized to detect putative virulence factors in the investigated Salmonella genomes [43]. The first step in the detection of virulence factors by the tool involves construction of orthologous groups within the query genome sequence as well as in the pre-investigated Salmonella reference genomes archived in the database to avoid detection of paralogs. Next, it carries iterative and exhaustive sequence similarity search against the VFDB database for accurate detection of virulence factors. Finally, a context-based data refinement process is performed for virulence factors encoded by gene clusters [43]. We also carried out additional NCBI BLAST searches for the virulence genes that showed differential distribution after manual curation of VFanalyzer tool results.

Identification and in silico characterization of prophage sequences
Potential prophage sequences within the Salmonella genomes were detected and annotated by employing PHASTER (Phage Search Tool Enhanced Release). PHASTER, an improved version of PHAST phage search tools detects candidate prophage regions in the bacterial genomes and then classifies the identified regions into three classes i.e. intact, incomplete and questionable on the basis score obtained [47, 48].

Detection and analysis of Type II Toxin-Antitoxin (TA) gene cassettes
Type II TA loci were predicted in the Salmonella genomes by employing TAfinder [49]. The parameters used for the detection were BLAST e value: 0.01; HMMer E-value: 1; Maximum length for candidate toxin/antitoxin: 300; maximum overlap between candidate toxin and antitoxin: (-20-150).

Comparative genome statistics
Comparative genome analysis of selected nine strains of Salmonella species (  (Table 1). In addition, all the investigated Salmonella genomes were detected to possess only one tmRNA. Notably, only four rRNA sequence were detected in S. Gallinarum Sal40 strain VTCCBAA614 in comparison to possession of 23 rRNA sequences in S. Pullorum str. ATCC 9120 and presence of 22 rRNA sequences in rest of the genomes ( Table 1). The circular plot visualization of investigated genomes with reference to S. Enteritidis str. P125109 depicts varied GC content and GC skew along with predominant similarity of core-regions of the investigated genomes (Fig 1).

Pan and core-genome analysis unravels strain specific genes
The pan-genome calculation by customised set up in EDGAR led us to have discrete idea of the total genetic repository of the Salmonella genomes under study. The pan-genome of S. Gallinarum was found to be in open state with growth exponent value of 0.089 (95% confidence interval 0.084 to 0.093) (S1 Table). The pan-genome development plot showed steady growth with addition of each new genome and reached 5091 on addition of ninth genome which is nearly 1.1 times the average number of genes of nine strains (Fig 2A, S1 Table). On the other hand, core-genome development plot became limited to 3270 genes, with shared genes decreasing with new genome addition. The extrapolated core-genome size was 2684 (95% confidence interval 2507.66 to 2861.11) (Fig 2B, S1 Table). In addition, singleton development plot also indicated S. enterica serovar Gallinarum to be in an open pan-genome state as 43 new genes were predicted to be found at every genome addition (Fig 2C, S1 Table). The pan-genome of investigated Salmonella strains with the reference to S. Enteritidis str. P125109 strain is composed of 5,091 coding sequences (CDS) which include a core-genome of 3,270 (64.2%) CDS, a dispensable genome of 1254 (24.6%) CDS, and 567 (11.1%) singletons. The complete list of CDS detected as part of pan-genome, core-genome, and singletons with their function are listed in S3, S4 and S5A-S5I Tables. Notably, a total of 318 CDS in the range of 3-102 (Table 2, S5 Table) were found as strain specific genes aka singletons in the investigated genomes. Significantly, S. Gallinarum Sal40 strain VTCCBAA614 harboured the highest number (102) of singletons among the investigated Salmonella strains (Table 2, S5 Table). The singletons detected were identified to be as hypothetical proteins (61), transposase (13), membrane proteins (8), ATP binding proteins (3) among others. Interestingly, genomes of S. Enteritidis str. P125109 and S. Pullorum QJ-2D-Sal possessed 96 and 85 singletons each, respectively.
On the other hand, S. Pullorum str. ATCC 9120 and S. Pullorum str. S06004 genomes harboured ten singletons each, whereas, the rest of the genomes of possessed �5 singletons each ( Table 2, S5A-S5I Table). Functional annotation of core-genome and strain specific CDS by assignment of orthology (COG) was performed by eggNOG-mapper v2. The annotation by eggNOG-mapper provided detailed description of GO term, EC number, annotation level, COG category and description (S6 Table). 29,125 CDS (98.96%) out of the total 29,430 (core-genome sequences) were queried by eggNOG-mapper v2 for orthology mapping. The highest number among them belonged to the functional classes of function unknown (5923) transcription (2592), amino acid metabolism and transport (2196), energy production and conversion (2178), and cell wall/membrane/ envelop biogenesis (2133) (Fig 3A, S6A Table). On the other hand, 153 out of the 318 CDS (complete set of identified singleton sequences) were queried by eggNOG-mapper v2. The highest number amongst them belonged to the classes of function unknown (68), replication and repair (24) and defence mechanism (6) among others. Thirty five singletons were not provided any COG category by eggNOG-mapper ( Fig 3B, S6B Table)).

Phylogenetic and synteny plot analysis
A phylogenetic tree was constructed on the basis of core-genome comprising of 9,83,299 aa-residues/bp per genome and 88,49,691 in total of nine Salmonella genomes by EDGAR (Fig 4) using S. enterica serotype Enteritidis strain P125109 as a reference sequence to demonstrate the evolutionary relationship among the S. Gallinarum strains used in the study.SH support values for our tree were very good in general, with a minimum value of 0.772 and only two values below the maximum of 1.00. The bvG and bvP formed two distinct and strongly supported clades in the phylogeny (Fig 4). In concordance with their taxonomic classification bvG strains clustered together i.e. S. Gallinarum str. 287/91, S. Gallinarum str. 9184 and Gallinarum Sal40 strain VTCCBAA614) and formed one clade. Whereas, the rest of the bvP strains bifurcated into 2 clades (Fig 4).
The synteny plot analysis performed by EDGAR of the investigated Salmonella genomes in reference to S. Enteritidis str. P125109 depicted large scale genomic rearrangements which included relocations, inversions, duplications and deletions ( Fig 5). It was observed that all bvP strains genomes showed high degree of conservation of gene order among genomes. The highly similar genomic rearrangement within bvP comprised of inversion and duplication. However, the bvG strain 9184 was characterized by large region of inversion, which was not observed in bvG 287/91 in reference to S. Enteritidis str. P125109 strain. Synteny plot of individual investigated strains against the reference genome of S. Enteritidis str. P125109 is shown in S1 Fig.

Virulence factor detection
Putative virulence factors were detected by VFanalyzer tool in investigated Salmonella genomes. The tool identified candidate virulence factors related to capsule, fimbrial adherence, macrophage inducible genes, magnesium uptake, non-fimbrial adherence, regulation, secretion system, stress adaptation, autotransporter and invasion as listed (S7A-S7C Table). Fimbrial adherence genes which includes agf/csg, bcf, fim, lpf, peg, pef, saf, sef, sta, stb, stc, std, ste,  stf, stg, sth, sti, stj, stk and tcf were searched for in the investigated genomes. Notably, pef, sta, stc, stg, stj, stk and tcf operons were not observed in any of the analyzed genomes. Manual curation of the results obtained from the tool alongwith BLASTN searches revealed that among the fimbrial adherence determinants, chaperone-usher fimbrial gene operons of bcf, fim, lpf, peg, saf, stb, std, ste, stf, sth and sti were present in the S. Enteritidis str. P125109, however genes in the fimbrial operon sef were predicted to be missing in this strain (S7C Table).
On the other hand, among the investigated S. Gallinarum genomes, chaperone-usher fimbrial gene operons, bcf, fim, lpf, peg, saf, stb, ste, stf, sth and sti were present. Fimbrial operon sef and std operon were additionally predicted to be disrupted in the S. Gallinarum genomes, with stdA and stdB showing no significant similarity and stdC homolog showing divergent sequence (69.11% similarity) on BLASTN analysis (S7B Table). The operon of adhesin called thin curled fimbriae (curli), which is encoded by the agf /csg fimbrial gene cluster was detected in all the 9 genomes analysed (S7C Table).

Detection of Salmonella Pathogenic Islands (SPIs)
A total of 113 homologs of SPIs with an average of 14.77 SPIs were detected in the investigated Salmonella genomes by the combined usage of web tool SPIFinder 1.0and NCBI BLAST search. The integrated approach revealed presence of high sequence identity SP-1, SP-2, SP-3, SP-4, SPI-5, SPI-12, SPI-13, SP-14, C63PI and SPCS54 islands in their genomes (S8A and S8B Table). The detected SPIs varied in the range of 0.3 to 41.8 kb in terms of size. Interestingly, 3 SPI-13 (AY956834, AY956833, AY956832) and 2 SPI-14 (AY956835, AY956836) were observed in each of the investigated genome with the exception of S. Gallinarum str. 9184 genome, wherein SPI-14 (AY956836) (0.4 kb) was not detected. Moreover, SPI-12 was also not observed in S. Gallinarum Sal40 strain VTCCBAA614. Notably, various other SPIs and resistance islands reported in S. enterica serovar such as SESS LEE, SGI-1, SPI-10, SPI-11, SPI-2, SPI-6, SPI-7, and HP1 whose sequence were extracted from PAIDB database were not found to be present by both the approaches. The features of detected SPI homologs in their respective genomes such as starting position, end position, size, and external annotation if any are detailed in (S8A and S8B Table).

Identification and analysis of prophage and prophage remnant regions
Each of the analyzed Salmonella genome was detected to possess at least one candidate prophage region by PHASTER (S2 Fig, S9 Table). In total, 23 prophage regions were identified in which fifteen were classified as putative intact regions and eight as incomplete prophage regions by PHASTER (S2 Fig, S9A Table). The detailed information of the identified prophage regions i.e., region name, region length, completeness, score, total protein, region position and GC percentage in the respective Salmonella genomes as determined by the web server are listed in S9A Table. The highest number of prophage region were detected in S. Pullorum QJ-2D-Sal and S. Pullorum str. S06004 with each possessing three complete prophages and one incomplete prophage region. In total, there were 5 different propahge elements with Gifsy_2 being the most common, which was detected in all genomes analysed. On the other hand, all 3 bvG possessed only a single (Gifsy_2) intact prophage element. Whereas bvP harboured a variety of phage elements including Gifsy_2. The average size of identified prophage regions was 36.31 kb in which largest (64.3 kb) and smallest (8.5 kb) candidate prophage regions were observed in S. Enteritidis str. P125109. On the other hand, the average size of identified complete prophage sequences in the analyzed Salmonella genomes was found to be 44.98 kb with lowest size of 29.2 kb belonging to S. Pullorum str. ATCC 9120. (S9A and S9B Table). Whereas, largest intact prophage region of 64.3 kb was observed in S. Enteritidis str. P125109 genome. The GC percentage in identified prophage region varied from 45.59% to 53.21% (S9A Table). On the other hand GC% of complete prophage regions varied from 47.23% to 53.21% (S9B Table).

Detection of acquired antibiotic resistance genes/chromosomal point mutations
Interestingly, all the analyzed Salmonella genomes harboured acquired aminoglycoside resistance gene (aac(6')-Iaa) (S3 Fig, S11 Table) as determined by Resfinder. The information related to identity, contig and position of the identified aminoglycoside gene in their respective genomes is listed in S11 Table. On the other hand, no acquired resistance gene for beta-lactam, colistin, fluoroquinolone, fosfomycin, fusidic acid, glycopeptide, MLS-macrolide, lincosamide and streptogramin B nitroimidazole, oxazolidinone, phenicol, rifampicin, sulphonamide, tetracycline, and trimethoprim was detected in the analyzed Salmonella genomes by Resfinder. Strikingly, known chromosomal point mutation in gyrA was detected in 2 bvG and 2 bvP genomes out of the nine analysed i.e., S. Gallinarum str. 287/91, S. Gallinarum Sal40 strain VTCCBAA614, S. Pullorum str. S06004 and S. Pullorum QJ-2D-Sal (S11 Table). In addition, unknown mutations were detected in 16S_rrsD gene of all the analysed genomes (S11 Table). Moreover, unknown mutations in parE were detected by Resfinder in S. Gallinarum Sal40 strain VTCCBAA614 and S. Pullorum str. S06004. A total of six antimicrobial resistance genes (acquired, known and unknown chromosomal point mutations) were identified among the nine investigated strains.

Discussion
Salmonella enterica subsp. enterica comprises of both host-adapted and host-promiscuous pathotypes that causes a spectrum of diseases depending on the serovar or host [51, 52]. Pullorum Disease (PD) and Fowl Typhoid (FT) caused by S. Gallinarum biovars Pullorum (bvP) and Gallinarum (bvG) respectively are endemic in countries of Asia and South America leading to huge economic losses to poultry industry [1,5,53]. It is interesting to note that as compared to FT, the PD reports have been low in India [54]. On the other hand, PD is frequent in China, being prevalent in every province [5].  genomes and gain insights into pan genome, pathogenesis, mobiliome, resistome, and taxonomy.
For a comprehensive analysis in an open pan-genome, determination of minimum number of necessary strains is difficult [27]. The present analysis entails sufficient genotypic variability, as our selected eight genomes represent a single serovar Gallinarum of Salmonella enterica ssp. enterica, and the total SNP count in each genome varied from 7424 to 12,726 ( Table 1). The pan-genome of investigated strains of S. Gallinarum comprises of 5091 CDS with a coregenome of 3270 CDS and an indispensible genome of 1254 CDS (Table 2). With a R cp of 64%, the genomes depict a fairly high degree of genomic diversity and asymmetry [59].
The pan-genome and core-genome development plot analysis in the present investigation revealed S. Gallinarum to possess an open pan-genome state (Fig 2A and 2B). A steady growth with addition of new genomes is depicted in the pan-genome development plot, which indicates the capacity of this sympatric species to rapidly acquire exogenous DNA [60]. The core development plot becomes limited to about 2600 genes (Fig 2B). The acquisition of exogenous DNA poses obvious challenge for control strategies against FT/PD, however core-genome based antigen selection also offers insight for vaccine candidates. It is important to examine these core genes for reverse vaccinology application as their presence in all strains and high degree of conservation makes them effective source of potentially universal antigens [61]. Although the dispensable genes are strain-restricted, they can also be explored and exploited for immune-antigens [61] (Fig 3, S6B Table). Interestingly, strain specific genes were also detected in all the investigated genomes in the range of 3-102 (S5 Table). Furthermore, singleton development plot demonstrated the possibility of finding 43 new genes with each newly sequenced genome (Fig 2C). Indian strain S. Gallinarum Sal40 strain VTCCBAA614 harboured the highest number of singletons i.e. 102 whilst S. Enteritidis str. P125109 possessed 96 singletons (S5 Table). Whether the presence of large number of singletons in S. Gallinarum Sal40 strain VTCCBAA614 and S. Pullorum QJ-2D-Sal, in comparison to other investigated strains is related to higher acquisition of foreign genes requires to be investigated.
Lower core-genome to pan-genome ratio and presence of large number of singletons in S. Enteritidis str. P125109 and other investigated strains suggests the dynamic state of S. Gallinarum genome. Salmonella Gallinarum has undergone extensive degradation and also acquired specific genes related to virulence, non-virulence, metabolism, information storage by horizontal gene transfer which may contribute to its avian host specificity [10,27,62].
Studies on very large number of Salmonella enterica genomes such as one employing 4893 genomes detected a pan-genome of 25.3 Mbp, a strict core of 1.5 Mbp present in all genomes, and a conserved core of 3.2 Mbp found in at least 96% of these genomes [63]. This makes for a strong case for creating a database of serovar specific genome components such as for S. Gallinarum, a host specific pathogen.
Following core-genome identification, its functional analysis by employment of eggNOGmapper v2revealed that large number of the CDS (20%) belonged to function unknown category (5923), which indicates that our understanding of genetic repertoire of Salmonella genome is limited. It is probable that a sizeable number of such CDS may be mutant phenotypes for protein-coding genes [64]. Otherwise the maximum genes were related to transcription, cell-wall and membrane biogenesis, energy metabolism, amino acid metabolism and transportation among others (Fig 3, S6A Table). The identified core-genome is also dominated by ATP-binding cassette (ABC) exporters which could be explored as novel drug targets [65]. On the other hand, the functional annotation of complete set of strain specific CDS of investigated strains fell in the categories of function unknown, replication and repair, and defence mechanism among others (Fig 3, S6B Table).
The core-genome based phylogeny (Fig 4) of S. Gallinarum biovars. from different geographical origins has clearly divided strains into divergent clades, and Indian strain was grouped externally in the S. Gallinarum clade, as in previous study [1]. Similarly, the phylogenetic relationship between S. Gallinarum biovars and S. Enteritidis has been genomically analysed, which highlighted the shared ancestry of bvP and bvG with S. Enteritidis descendent originating in a 'second' clade [8]. In a recent study, the core-genome SNP phylogeny of only bvP strains from China has divided it into four lineages [5], which suggests that the study of within biovar genomic epidemiology of Indian bvG strains will similarly need inclusion of more number of local strains. The genomic diversity within the investigated S. Gallinarum strains is supported by large scale genomic rearrangement events including inversions, deletions, relocations and duplications detected in synteny analysis (Fig 5, S1 Fig).
The evolution of host-adapted serovars of salmonellae has traversed a path of periodic gene acquisition and gene disruption as a mechanism of niche adaptation away from enteric location to systemic one [8]. These genetic elements contribute to the bacterial virulence, pathogenesis, fitness in specific niche, and evolution [8,12]. Putative virulence factors in the investigated Salmonella genomes related to fimbrial adherence and secretion system among others were searched and analysed (S7A-S7C Table).
The fimbrial antigens play an important role in adherence of bacteria to the cell surface, which is essential to the pathogenesis of the disease, leading to bacterial invasion [66]. Although a number of fimbriae have been described in Salmonella, the chaperone-usher pathway represent the major fimbriae encoded by Salmonella [67][68][69][70][71][72].
The degradation in fimbrial operon in S. Gallinarum has been associated with departure from intestinal colonization [5]. In our investigation, the fimbrial operons Agf/Csg, bcf, lpf, peg, saf, stb, ste, stf, sth and sti were observed to be present in all of the S. Gallinarum genomes. Suez et al, (2013) [73] performed Comparative Genomic Hybridization (CGH) on 5656 Salmonella ORF representing 12 invasive Salmonella serovars including serovar Dublin and Cholerasuis, 2 host-associated serovars, and detected five fimbrial clusters (bcf, csg, stb, sth and sti) to be part of core genome for invasion and systemic disease in humans. In our study, among the S. Gallinarum genomes, chaperone-usher fimbrial gene operons, bcf, lpf, peg, saf, stb, ste, stf, sth and sti were detected to be part of the core genome. Notably, in the S. Gallinarum genomes studied, operon fim, sef, and std were predicted to be disrupted (S7C Table).
Hu et al, (2019) [5], reported std along with saf and csg to be intact in almost all the S. Pullorum strains in their study. On the other hand, we detected differential distribution of std operon in our study. The std operon, which was detected in S. Enteritidis str. P125109 was not detected in any of the 3 bvG, and 2 of the 5 bvP genomes studied (i.e., S. Gallinarum/pullorum str. CDC1983-67 and S. Gallinarum/pullorum str. RKS5078). In addition to std operon, the sef operon, sefA and D genes were not detected in S. Enteritidis str. P125109 and all other investigated genomes.SEF14 plays a role in the colonization of Peyer's patches and in the adhesion and invasion of intestine epithelial cells [74][75][76].
The lpf, fim, and agf /csg operons, which were detected in all of our investigated genomes, encode fimbrial proteins which mediate attachment of S. enterica serotype Typhimurium in epithelial cell lines in vitro for long-term intestinal persistence [77]. The gene csgD of agf /csg operon activates the production of curli fimbriae by transcriptional activation of the csgBAC operon encoding the structural genes of curli fimbriae [78]. The csgA gene of agf /csgoperon has also been evidenced to be involved in biofilm formation in S. Pullorum, which may contribute to the virulence [79].
The type 1 fimbriae (T1F) are important virulence factors in Enterobactericeae including Salmonella spp., and is commonly expressed in virulent strains of Salmonella spp. [80,81]. In our study, the fim operon was detected in all genomes except absence of fimA and I genes in S.
In all the genomes studied, pegABCD operon was detected to be the part of core genome. The peg fimbrial gene cluster originally discovered in S. enterica subsp. enterica serovars Paratyphi A, Enteritidis, and Gallinarum [10], is considered to influence caecal colonisation of chickens by S. Enteritidis [83]. However, Hu et al, 2019 [5], found the peg operon, disrupted in all strains of S. Pullorum, suggesting their redundancy.
We next focussed our investigation on identification and analysis of candidate virulence factors that were previously not identified in the genomes of the analyzed strains which included genomic islands, prophages, TA cassettes and AMR genes.
The acquisition of virulence capability enhancement en masse by acquisition of SPI has been hall mark of S. Gallinarum [17,88]. A total of 113 SPI homologs with nearly 15 SPIs per genome were detected by combinatorial usage of SPIfinder and BLAST searching (S8A and S8B Table), which uncovered the presence of homologs of SP-1, SP-2, SP-3, SP-4, SPI-5, SPI-12, SPI-13, SP-14, C63PI and SPCS54 island in the investigated genomes (S8A and S8B Table). The presence of particular SPIs in each of the genomes indicates their specificity towards S. Gallinarum and S. Enteritidis and thus corroborates with previously published reports of SPIs showing serovar specificity [44, 89,90].Notably, the role of SPI's other than SP1 and SP2 in virulence and pathology of S. Gallinarum infections has not been analyzed in detail. The elucidation of functional role of other SPIs thus requires further experimental investigations with functional knockout mutations or via targeting and deletion of SPIs by utilization of CRISPR-Cas systems [91][92][93][94]. The limitation of manually screening mutant phenotypes in functional knockout mutation studies has led to use of high-thoroughput sequencing technologies in conjunction with transposon mutagenesis i.e., transposon sequencing (Tn-seq) [95,96]. Transposon-directed insertion site sequencing (TraDIS) was used to study the mechanisms used by Salmonella to enter and persist within the bovine lymphatic system in comparison to intestinal colonization [97]. Several variations of Tn-seq have been devised which mainly differ in transposon junction sequence amplification techniques, which can be applied to obtain global gene functional information in Salmonella.
Prophages contribute to pathogenicity, bacterial fitness, diversity, resistance to phages, antimicrobial resistance, evolution as well as aid in increased environmental tolerance [20, 100,101].
In our study, the intact prophages were detected in the range of 1-3 with the highest number in bvP QJ-2D-Sal and bvP str. S06004. As in our study, a recent study also detected only a single phage in all of their bvG genomes, which is less than 5±3 prophage regions per genome detected in 1,760 S. enterica genomes, using Phaster [102]. However the observed prophage diversity in our study is in contrast with lower diversity reported by [103], using 2 strains of S. Gallinarum. The gifsy_2 prophage element which was detected in all genomes analysed in this study was earlier detected predominantly in S. Typhimurium genome [102]. Significantly prophage element ST104 detected in bvP QJ-2D-Sal has been previously reported in MDR S. Typhimurium DT104 as phage ST104. Notably, prophage and plasmids acquisition has been demonstrated in S. Pullorum leading to emergence of multi-drug resistant (MDR) strains [5,104]. Our study also highlights variable number of propahges detected in bvP as compared to single prophage element detected in bvG. Hu et al, 2019 [5] also detected multiple prophage elements in their bvP genomes, which leads to high genetic diversity besides AMR. The differential presence of prophages in the current study indicates the ability of S. Gallinarum genomes to acquire new gene content, and introduce diversity in them which is in accordance with the singleton development plot that depicted the finding of 43 new genes with each newly sequenced genome. This is further refurbished by detection of holin, phage-like protein, phage membrane protein, side tail fibre protein genes among 104 singletons detected in S. Gallinarum Sal40 strain by EDGAR (S4I Table)). The differential distribution of prophages in the current study can be utilized as markers for strain characterization, diversity assessment, tracking of strains, determination of transmission history of S. enterica Gallinarum strains as phage typing has been widely employed by various investigators for these goals in various bacterial species [21, [105][106][107][108][109]. Moreover, cryptic prophages identified in the study can be utilized as drug target after determination of their role and essentiality status in the genome as they are permanent resident of microbial genomes [110].
In order to elucidate the role of prophages of S. Gallinarum, prophage induction may be achieved using the SOS response activated by chemicals such as mitomycin C, hydrogen peroxide and UV radiation, which cause DNA damage, apart from instances of spontaneous induction, and inflammation [111][112][113]. Induced prophages can be further studied by genomic sequencing and bioinformatic methods [112]. Further, specific prophage methylation pattern and repertoire of methyltransferase motif estimation can give useful insights on prophage gene expression [114]. Nothwithstanding the widespread repression of prophage genes, transcriptomics data analysis of propahge regions can be an important way forward [112,115].
Toxin-antitoxin operons which are implicated in pathogenicity, virulence, persistence, stress endurance, and antibiotic resistance in bacteria and archae were searched for in the investigated Salmonella genomes [116][117][118]. The computational search revealed a total of 149 TA cassettes in the range of 16-18 in the investigated S. Gallinarum genomes (S10 Table). Significantly, a majority of the toxins of identified TA systems belonged to relElike family and highest number of identified antitoxins possessed RHH like domain (S10 Table, Fig 6). The presence of 149 Toxin antitoxin operons in the range of 16-18 in each of the analyzed S. Gallinarum genomes indicates their possible role in growth regulation, niche adaptation, encountering of various stress responses inside macrophage as well as in persistence, as has been observed in various bacteria and archaea including S. Typhimurium LT2, [24,117,119]. Moreover, the biological significance of presence of large number of TA's may be due to their potential involvement in number of underlying regulatory pathways, as can be inferred by identification of different domains such as GNATlike_domain (24 nos.), MazFlike_domain (9 nos.), COG2929 like_domain (9 nos.), yeeU (8 nos.), doc (5 nos.), and PINlike_domain (3 nos.) in the structure of the toxins detected in the study. These TA domains have been previously reported to be associated with distinct features and different functional roles [117,120]. As the acquisition of TA operons can be either through horizontal gene transfer or gene amplification [121], therefore the diversity of TA operons observed in Salmonella genomes reemphasizes the flexible nature of S. Gallinarum genome. Elucidation of their role requires further experimental analysis to be carried out.
The in silico identified TA systems also requires to be experimentally characterized before considering them as bonafide TA operons by designing assays to test whether toxicity of overexpressed toxin protein of TA pair is abrogated by over-expression of putative antitoxin protein in heterologous systems [122][123][124]. Moreover, experimental investigation is required to decode their role in pathogenesis. In addition, to knock out studies a number of powerful genetic tools can be employed to elucidate the function of detected virulence factors. Ectopic expression of TA systems can be studied to decipher their role in growth regulation [125,126]. In addition, site directed mutagenesis can be utilized for detection of residues crucial for toxin activity for functional characterization of TA systems [127,128]. Moreover, expression profile of identified TA modules in response to various stresses at multiple time points can be detected by using whole genome microarray platform comprising of TA loci probes and that could lead to co-expressed genes/pathways [129]. The Salmonella TA systems can be considered as potent targets for structure based inhibitor design as has been previously employed against various other bacteria [130,131].
Although plasmids are important carriers of AMR genes in Salmonella serovars, [132], we detected the presence of acquired aminoglycoside resistance gene (aac(6')-Iaa) in all the investigated Salmonella genomes (S3 Fig, S11A Table). Neuert et al., 2018 [133] found in their study on Non-typhoidal Salmonella (NTS), that nearly all but eight of the total 3,491 isolates carried an aminoglycoside acetyltransferase aac(6 0 )-type gene. However, only eleven showed phenotypic resistance to an aminoglycoside antimicrobial. In another study, the genes associated with ant and aph were found in all aminoglycoside-resistant Salmonella isolates, but aac genes were only found in Salmonella ser. Typhimurium and Salmonella ser. Heidelberg from chicken-sourced isolates [134]. The observation is in concordance with previously published reports stating that aminoglycosides exhibit weak bactericidal activity against intracellular S. enterica serovars [135]. Strikingly, known mutation in gyrA were also observed in four genomes out of the nine investigated i.e., S. Gallinarum str. 287/91, S. Gallinarum Sal40 strain VTCCBAA614, S. Pullorum str. S06004 and S. Pullorum QJ-2D-Sal (S11B Table). Koerich et al (2018) [58] also reported high resistance to drugs from macrolide and quinolone groups in Salmonella Gallinarum field isolates. Literature mining has revealed that mutations in gyrA are associated with resistance to fluoroquinolones in eight species of Enterobacteriaceae [136]. Chromosonal point mutations in rrsD gene in the helix 34 region, which results in changes to the drug-binding site in the ribosomal 16S rRNA has been reported to confer resistance against aminoglycoside spectinomycin of Salmonella enterica serovar Typhimurium gene [137,138]. Workers have also reported mutations in rrsD genes in 2 S. enterica isolates from cattle and chicken in South Africa [139].
Antimicrobial resistance in Salmonella serovars is a serious poultry husbandry and public health problem all over the world including Asian, and South American region [58, 140,141]. The observation of variable Antimicrobial resistant genes in the study can be utilized as AMR markers. The presence of additional AMR genes in S. enterica Gallinarum str. Sal40 and S. enterica Pullorum str. S06004 indicates the ability of acquiring exogenous DNA for adaptation in response to environmental requirements. The differential presence underlines the need of NGS based mapping of AMR genes in S. Gallinarum serovars to better manage their control measures [142].
A number of investigators have previously carried out comparative genome analysis of S. Gallinarum with several other serovars which includes S. Enteritidis, S. Typhimurium to decode the genomic differences amongst them. The comparative genome analysis of S. Enteritidis PT4, S. Typhimurium LT2 and S. Gallinarum 287/91 was carried by Thomson et al., (2008) [10] wherein they reported predominant similarity and synteny in their core genomes and overrepresentation of putative pseudogenes in S. Gallinarum 287/91 in comparison to S.
Enteritidis str. P125109. Similarly, a genomic comparison between S. Gallinarum and S. Pullorum was carried out by Feng and associates [143] and they included S. Gallinarum str. CDC1983-67, S. Gallinarum str. RKS5078, S. Gallinarum str. 287/91, and S. Enteritidis str. P125109 and reported a high number and differential distribution of pseudogenes in bvG and bvP strains in reference to S. Enteritidis strain. On the other hand, genome comparison between closely related S. enterica serovars Enteritidis, Dublin, and Gallinarum revealed differential distribution of prophages and pseudogenes in their genomes along besides SNPs, insertions, and deletions amongst the investigated strains [103]. In a similar manner, to decode different host-specificities, genes related to SPI of S. Enteritidis PT4 (NCTC 13349) and S. Gallinarum 287/91 NCTC 13346 were compared by Eswarappa et al. 2009 [144] and detected 24 positively selected genes that included SPI-2 TTSS and effector proteins of SPI-1 TTSS.
The present study suggests that the host restricted S. Gallinarum strains harbour strain specific genes and they exhibit differential distribution of putative virulence factors such as genomic islands, prophage regions, TA cassettes, and acquired AMR genes in their genomes. This study also highlights the need to analyse more number of S. Gallinarum genomes to understand the phylogeny and better capture of its pan-genome biodiversity. This comparative genomic analysis of S. Gallinarum serovar has provided a valuable insight and laid foundation for future experimental studies to be carried out to decipher the underlying mechanisms driving the pathogenesis and virulence of this avian restricted pathogen.

Conclusion
Salmonella enterica serovar Gallinarum biovar Pullorum (bvP) and biovar Gallinarum (bvG), are the causative agents of disease (PD) and fowl typhoid (FT) respectively, which causes considerable economic losses to poultry industry worldwide especially in developing countries including India. Comprehensive comparative genome analysis of eight S. enterica serovar Gallinarum strains originating from different geographical regions including Indian strain S. Gallinarum Sal40 VTCCBAA614 was carried out to decode the genotypic differences amongst them with a focus on detection and analysis of candidate virulence factors. The investigation revealed an open pangenome for S. Gallinarum encompassing 5091 coding sequence (CDS) with 3270 CDS belonging to core-genome, 1254 CDS to dispensable genome and strain specific genes amongst the analyzed strains. Furthermore, analysis of distribution of candidate virulence factors in the investigated strains revealed diversity and differential distribution of genomic features such as genomic islands, prophage regions, toxin-antitoxin operons, and acquired antimicrobial resistance genes. The fimbrial operons Agf/Csg, bcf, lpf, peg, saf, stb, ste, stf, sth and sti were observed to be present in all of the S. Gallinarum genomes, whereas operon sef, and std showed differential distribution. The computational search unravelled the existence of high sequence identity genomic islands SP-1, SP-2, SP-3, SP-4, SPI-5, SPI-12, SPI-13, SP-14, C63PI and SPCS54 in their genomes. Additionally, 23 prophage regions and 149 Type II TA loci were also identified and characterized in the investigated genomes. The genomic variability detected among the S. enterica serovar Gallinarum strains will form the basis for future experimental investigations and could be used for bacterial typing. The core genome assignment can be utilised towards design of diagnostics, as well as drug and vaccine target prediction for effective surveillance, prevention, and control.