Genetic diversity of Francisella tularensis subsp. holarctica in Kazakhstan

Tularemia is a highly dangerous zoonotic infection due to the bacteria Francisella tularensis. Low genetic diversity promoted the use of polymorphic tandem repeats (MLVA) as first-line assay for genetic description. Whole genome sequencing (WGS) is becoming increasingly accessible, opening the perspective of a time when WGS might become the universal genotyping assay. The main goal of this study was to describe F. tularensis strains circulating in Kazakhstan based on WGS data and develop a MLVA assay compatible with in vitro and in silico analysis. In vitro MLVA genotyping and WGS were performed for the vaccine strain and for 38 strains isolated in Kazakhstan from natural water bodies, ticks, rodents, carnivores, and from one migratory bird, an Isabellina wheatear captured in a rodent burrow. The two genotyping approaches were congruent and allowed to attribute all strains to two F. tularensis holarctica lineages, B.4 and B.12. The seven tandem repeats polymorphic in the investigated strain collection could be typed in a single multiplex PCR assay. Identical MLVA genotypes were produced by in vitro and in silico analysis, demonstrating full compatibility between the two approaches. The strains from Kazakhstan were compared to all publicly available WGS data of worldwide origin by whole genome SNP (wgSNP) analysis. Genotypes differing at a single SNP position were collected within a time interval of more than fifty years, from locations separated from each other by more than one thousand kilometers, supporting a role for migratory birds in the worldwide spread of the bacteria.


Introduction
Precise knowledge of the genetic diversity of microorganisms is useful for multiple purposes, including understanding of the emergence of pathogens, tracking and controlling their spread, as well as certifying and patenting strains of interest for biotechnologies [1][2][3]. DNA "fingerprinting" of bacteria began to develop with the understanding of the role of this molecule and improved throughout the development of DNA technologies. Genotyping based on DNA fragmentation and resolution by Pulsed-Field Gel Electrophoresis (PFGE), Random Amplified Polymorphic DNA (RAPD), Restriction Fragment Length Polymorphism (RFLP) have been replaced by simpler, more accurate and reproducible (numeric) methods including Multiple-Locus Variable-number of tandem-repeats (VNTR) Analysis (MLVA), MultiLocus Sequence Typing (MLST) and selected single nucleotide polymorphism (canSNP) typing [4][5][6]. Currently, whole-genome sequencing (WGS) of bacterial strains, the ultimate genotyping method, is becoming increasingly accessible but is still too costly especially in developing countries to allow its use as a first-line assay. In the ongoing transition period, highly discriminatory and cheap DNA based genotyping methods can be used as a first-line method for selecting strains for WGS, especially in highly endemic regions and as a low-cost assay for routine quality control of strain collections [7].
Most laboratories are using one method or more among WGS, MLST, canSNP, MLVA and/or PFGE in parallel for a number of pathogens in order to carry out epidemiological control adapted to local context, capacities, and needs [8]. In this regard, the development of in silico genotyping methodology based on WGS data and allowing backward compatibility is particularly useful for pathogens for which MLVA and/or MLST were considered as the "golden standard" of genotyping such as Brucella spp, Bacillus anthracis, Yersinia pestis, F. tularensis or Neisseria meningitidis [9][10][11][12]. Carrying out in silico MLST on WGS data is not difficult owing to the availability of appropriate algorithms and allows comparing the results obtained previously [13]. Obtaining a relevant in silico MLVA profile is a more challenging task due to the occurrence of errors in the assembly of short reads and contig breaks in VNTR locations [14]. Nevertheless, the use of appropriate sequencing reads length allows to conduct in silico MLVA with high reliability. It has been established for Brucella that read sizes of 200 base-pairs or more allow to correctly assemble tandem repeat arrays for most loci used in the MLVA assay and to produce a reliable MLVA profile [15]. Simple scripts have been developed to run in silico MLVA (https://github.com/i2bc/MLVA_finder). FASTA files of assemblies (complete genomes or contigs) are commonly used as input. However, currently available assemblers often use k-mer values shorter than read length, which can be detrimental when assembling tandem repeat arrays and does not allow to take full advantage of reads length [16].
In silico MLVA analysis was not previously performed for F. tularensis, although it is considered to be among the six most dangerous pathogens with potential for use as an agent of bioterrorism [17] and MLVA typing has been shown to be relevant [4,18]. Twenty-five VNTRs have been described so far in the F. tularensis genome and various schemes using a subset of these have been proposed. MLVA correctly classifies F. tularensis into the three subspecies, of which the most pathogenic for humans is F. tularensis subsp. tularensis common in North America with a single strain of unclear origin described in Europe [18,19]. F. tularensis subsp. holarctica distributed in most of the northern hemisphere is less pathogenic for humans [20,21]. F. tularensis subsp. mediasiatica present in Central Asian region and southern Siberia has not been associated with human infections [22]. F. novicida was once proposed as subspecies based on its high genetic similarity (97%) with F. tularensis but this assignment is controversial and confusing in terms of ecology, evolution, and metabolism [23,24]. F. novicida detected in North America and Australia is weakly virulent and can cause disease in people with immunosuppression [23,25]. The evolution of F. tularensis is clonal [26] and the species constitute a monophyletic lineage in contrast to F. novicida which is a clear outgroup with a very different behavior [24]. The geographic origin and the date of emergence of F. tularensis are unknown, and the population structure is reflecting a capacity to travel long distances, as well as a very low mutation rate [26]. The clonal evolution of F. tularensis allows to use selected single nucleotide polymorphisms commonly called canSNPs [27] [5]. Currently available data on the B.6 holarctica sublineage in Western Europe is suggesting that Western Europe was contaminated from Eastern Europe [26,28,29]. The most basal B.16 lineage has been reported in Japan, China [30,31] and Turkey [32]. This feature combined with the geographic distribution of mediasiatica which was found only in Central Asia [22] is supporting an Asian or Central Asian origin for F. tularensis [30].
Due to the endemicity of tularemia in Kazakhstan, extensive field work has been done during the past seventy years. This work has allowed to map a number of natural foci accross the country [33]. Natural reservoirs of tularemia are registered in twelve among the 14 administrative regions of Kazakhstan and the total area of the reservoirs is about a fifth of the country's territory (552 thousand square kilometers). During the past 90 years (1928-2018) about 10.000 human cases of tularemia were registered in Kazakhstan. The majority of cases occured in the 1950s and were caused by labor migration of unvaccinated populations to endemic areas [34]. In the 1928-1970s period, the majority of cases occured as outbreaks involving from 105 to 1791 cases. Improving sanitary conditions and routine vaccination reduced the tularemia incidence to sporadic cases [35,36]. Nevertheless, F. tularensis is present in East Kazakhstan, Akmola, West Kazakhstan, Aktobe, North Kazakhstan and Pavlodar regions, as shown by periodic epizootics in rodents [37]. Strains of F. tularensis subsp. holarctica are routinely isolated from at least 26 vertebrate species including humans, rodents, hares and birds as well as invertebrate vectors (blood-sucking insects and arthropods) from natural habitats across Kazakhstan. Also, strains of F. tularensis subsp. mediasiatica are isolated in the Almaty and Zhambyl regions from rodents, hares and invertebrate vectors.
There is however currently little information about the genetic diversity of circulating strains. The present report is a contribution in order to fill this gap. We applied whole genome sequencing as well as in vitro MLVA genotyping on a relevant subset of 39 strains isolated in Kazakhstan. We compared the in vitro and in silico MLVA data deduced from WGS data, as well as the clustering achieved by MLVA and whole genome SNP (wgSNP) analysis. A panel of seven VNTR loci which can be amplified in a single PCR reaction and is well adapted to the genetic diversity observed in Kazakhstan was defined in order to provide a first-line genotyping assay usable on a large scale. In conclusion, we describe the genetic diversity of F. tularensis subsp. holarctica determined by MLVA, canSNP and wgSNP methods and compare with currently available WGS data of worldwide origins. This is the first report of WGS data for holarctica strains from Kazakhstan.

Strain and DNA extraction
A total of 39 strains of F. tularensis subsp. holarctica were obtained from the Kazakhstan "Republican collection of microorganisms and the depository of pathogens of especially dangerous infections" of the Republican State Enterprise on the Right of Economic Management "National Scientific Center for Especially Dangerous Infections named by Masgut Aykimbayev" of the Ministry of Healthcare of the Republic of Kazakhstan (NSCEDI). Thirty-eight strains were isolated from rodents (14 strains), ticks (19 strains), carnivores (two strains), one migratory bird and two water samples from eight regions of Kazakhstan in the period from 1951 to 2018. The vaccine strain F. tularensis 15 NIIEG (alias LVS) originating from Kazakhstan was included as a reference strain ( Table 1). The strains were isolated during routine monitoring or during periods of epizootological outbreaks, in accordance with applicable laws. All vertebrate hosts but one were trapped, and tick were host-seeking, pasture ticks collected from the vegetation. Work with cultures of tularemia pathogens was carried out in accordance with current regulatory and methodological documents on the organization and conduct of laboratory diagnosis of tularemia in Kazakhstan.
The lyophilized strains were plated on transparent agar nutrient medium for cultivation and isolation of tularemia microbe with vitamins and mineral additives ("FT-agar", FBIS SRCAMB, Obolensk, Russia) and cultured for 48 hours at a temperature of 37 ± 1˚C. A bacterial suspension was prepared and inactivated by the addition of merthiolate sodium to a final concentration of 0.01%, heated for 30 min at a temperature of 56˚C. DNA was isolated from inactivated suspension with QIAamp DNA Mini Kit (Qiagen, Germany).
SNPs were called by mapping raw sequencing reads on a reference genome using BioNumerics version 7.6.3 (Applied-Maths, Laethem-Saint-Martin, Belgium) with default parameters as previously described except when mapping Ion Torrent PGM sequence data for which the "allow gapped alignment" option was activated [22]. Publicly available assemblies and raw reads were downloaded via EBI-ENA (last download on August 25 th 2020). The list of public assemblies and sequence reads archives evaluated for this report is provided in S1 Table, as well as the list of accessions used for each figure. Assemblies were split into 50 bp long artificial reads with a 10X cover and no error model. These artificial reads were then mapped on the reference genome for SNP calling as done with raw sequencing reads. BioNumerics was used with default parameters (no resampling) for Maximum Parsimony analysis [40] and dendrogram drawing. Raw reads datasets were assembled using SPAdes [41] version 3.13.1 and SKESA [42] version 2.3.0. The canSNP assignment of each strain was deduced from genome assemblies using CanSNPer2 (https://github.com/FOI-Bioinformatics/CanSNPer2) [43] and the available F. tularensis canSNPer2 scheme [18,28,43,44].

In silico MLVA on genome assemblies and on raw sequence reads
In silico MLVA was run on genome assemblies by applying the MLVA_finder.py script with default settings (https://github.com/i2bc/MLVA_finder) and primers listed in S2 Table. The size of the polymorphic VNTR arrays being smaller than the 300 bp reads produced here, we also typed raw reads data. A simple script based on bbduk (BBTools, https:// sourceforge.net/projects/bbmap/) was developed to identify reads containing VNTRs and both flanking sequences (https://github.com/Vladislav-Shevtsov/search-primers-in-reads/). Primers were selected that were located as close as possible to the VNTR loci (S2 Table). When selecting primers for in silico instead of in vitro detection, attention was paid only to uniqueness and to the formation of only the target fragment. The script combines all reads containing a VNTR locus of interest in a single multifasta file on which the MLVA_finder.py script can subsequently be applied.

Whole genome SNP analysis
The WGS data obtained from the 39 strains was used for wgSNP sequence phylogenetic analysis together with representative publicly available WGS data (S1 Table).  All the other strains belong to holarctica clade B.12. Sixteen belong to subclade B.12_B.23 and seventeen to subclade B.12_B.42 [47]. Fig 2 illustrates that identical whole genome SNP genotypes are observed in strains collected in distant locations at different times. Strain Tul-92_KZ collected in West Kazakhstan in 1988 is identical to Tul-97_KZ collected in Pavlodar in 2013, i.e. twenty-five years later more than one thousand kilometers away. A single SNP separates Tul-13_KZ collected in 2012 in West-Kazakhstan from Tul-71_KZ collected 49 years earlier in the Almaty region, or Tul-52_KZ isolated in 2013 in Aktobe from Tul-135_KZ isolated sixty years earlier in the Almaty region. In both instances, the most recent strain showed the ancestral genotype, and the single SNP in the derived genotype is a G to A transition, in agreement with the previously observed AT mutation bias [26]. This indicates that identical wgSNP profile can be perpetuated for twenty-five years and strengthens previous observation of identical wgSNP genotypes in strains collected in Sweden 15 years apart [48] and in France, Switzerland and Germany up to 31 years apart [28]. Some sub-branches show a strong geographic  Table).
In contrast, eleven strains coming from six regions within Kazakhstan constitute a polytomy with three branches radiating from the blue star within the B.42 sublineage (Fig 2). Similarly branch L3 is represented by strain Tul-76 from Kazakhstan. L4 with 48 strains is by far the most represented in available WGS data. In addition to strain Tul-20 from Kazakhstan and strain B-7558 presumably from Russia, it comprises strains from Western Europe, particularly Germany (29 strains) and including Scandinavia. Within Europe, Sweden appears to have a special status, as strains from Sweden are present also in L2, and contribute one basal branch within L4. This observation is in agreement with previous reports of a relatively high genetic diversity in Scandinavia, which might suggest that Scandinavia is the primary source for tularemia in Western Europe [47] or that Scandinavia provides more ecological opportunities for long-term maintenance of F. tularensis lineages. Regarding polytomy B.66, Central Asia appears to be a better candidate for being the source since all four branches are represented.
Both lineages L2 and L4 contain secondary polytomies (S2 Fig). The secondary polytomy in L2 contains six branches, with length from root (blue star) to tips varying from five to 14 SNPs. This polytomy is remarquable by its geographic diversity. Two branches are represented by strains isolated in Sweden, and two branches are constituted by strains from West Kazakstan. One of them was isolated from a migratory bird (wheatear Oenanthe isabellina) [49]. The main polytomy in L4 (purple star) contains eleven radiating branches and is shown with a more specific country color code in S3 Fig. Eight branches are represented by strains from Germany and two by strains from Switzerland. The last branch is the most diverse, and was observed in Scandinavia, France, Hungaria and Germany. The most parcimonious interpretation of available data is that Scandinavia (possibly Sweden) was contaminated once by each of the two lineages, L2 and L4, possibly via migratory birds and that secondary contaminations occurred from Scandinavia towards Western Europe, mainly Germany. Much more will need to be known regarding the implication of migratory birds [50] in order to evaluate this preliminary hypothesis.

In vitro genotyping by MLVA
In order to develop a first line genotyping assay, we evaluated the discriminatory power of MLVA in the present collection. Five among the 25 VNTR loci constituting the full MLVA typing scheme are polymorphic, Ft-M3, Ft-M4, Ft-M6, Ft-M20 and Ft-M22 (Tables 2 and S3). The highest discrimination was observed at the FT-M3 locus in agreement with previous reports. Twelve alleles and an HGDI of 0.90 were observed among the 39 strains. Ft-M6 locus was the second most variable, with four alleles. At the Ft-M4, Ft-M20 and Ft-M22 loci, three, two and two alleles were observed respectively.
The five polymorphic loci among the classic 25 VNTR resolve 18 genotypes in the present collection (MLVA5 Nur-Sultan, Table 2). Two additional VNTR loci polymorphic in the present collection were identified by in silico analysis, insilico-FT-4_6bp_97bp_2u and insilico-FT-8_4bp_152bp_3u (S2 Table). According to in silico reads analysis, they are polymorphic in the present collection and allow to resolve one additional genotype (Tables 2 and S3). All seven loci can be amplified in a single PCR reaction with different fluorescent dyes and analysed on a genetic analyzer in one run.

In silico MLVA for F. tularensis subsp. holarctica isolates from Kazakhstan
To satisfy future needs in terms of compatibility between in silico and in vitro MLVA genotyping we evaluated two different approaches to deduce the MLVA genotype from sequence data. In the first and more traditional approach, sequencing reads were assembled before analysis. Two popular assemblers were compared, SPAdes and SKESA. In the second approach, raw sequencing reads corresponding to the VNTR loci were extracted and directly analysed to produce the in silico MLVA genotype. The approach takes advantage of the 300 base-pairs reads produced which should be long enough to enclose in a subset of reads the whole tandem repeat and a few flanking nucleotides from both sides, even for the largest Ft-M3 alleles.
The results obtained by direct reads analysis were fully identical to in vitro data (S3 Table). In contrast, discrepancies were observed in the assemblies. SKESA failed to assemble the Ft-M6 allele in two strains. Ft-M3 was not assembled in eleven and five strains with SPAdes and SKESA respectively (S3 Table). Furthermore Ft-M3 SPAdes assemblies showed an incorrect size in five additional strains. SKESA assemblies predicted an incorrect size for insilico-FT-4 in all strains but one.
In summary, direct analysis of 300 bp reads performed significantly better than assemblies analysis. As expected, the number of recovered reads is proportionally dependent on the size of the VNTR allele as illustrated for the Ft-M3 VNTR in S4 Table. Clustering of F. tularensis subsp. holarctica strains using MLVA and congruence with canSNP Based on seven variable loci, the 39 strains were clustered into 19 genotypes (Fig 3). Ten genotypes are unique, five genotypes are shared by two strains. The four most frequent genotypes are shared by 3, 4, 6 and 6 strains respectively. The eleven strains belonging to the B.66 polytomy are correctly clustered. More generally, the B.4, B.12_B.23 and B.12_B.42 groups are globally clustered as expected from the wgSNP analysis. Some phylogenetic inconsistencies of terminal branches where MLVA is not congruent with wgSNP phylogeny are visible, due to the homoplasia inherent to tandem repeat polymorphism. Most of these MLVA terminal branches are defined only by the highly variable Ft-M3 locus.

Discussion
We present the results of genotyping of thirty-nine F. tularensis subsp. holarctica strains isolated in Kazakhstan including the vaccine strain 15 NIIEG, using MLVA, canSNP and wgSNP. MLVA genotyping was performed by the classical genotyping scheme including 25 loci [4]. In agreement with previous reports five among the subset of ten loci selected by Vogler et al. were variable among the analyzed strains [4,18,51]. For genotyping of isolates circulating in the Central Asian region, we propose a MLVA7 panel combining these five classical VNTR loci and two additional VNTRs identified by in silico analysis of public WGS data. The proposed in silico MLVA genotyping script based on the direct analysis of 300 bp long raw reads made it possible to correctly identify all alleles in the 39 strains. Consequently, MLVA7 constitutes a one PCR assay compatible with in silico WGS analysis which should facilitate the selection of strains for sequencing and the quality control of strain identity. Based on wgSNP analysis, four genetic groups are distinguished in F. tularensis subsp. holarctica: B.4, B.6, B.12 and B.16 [20,44,52]. In our study, six strains isolated during 1951-89 in Eastern and Central Kazakhstan were assigned to the B.4 genetic group. Strains belonging to the B.4 genetic group were previously found in North America, China, and Northern-Eastern Europe [18,20,31,53,54]. The B.4 genotype might be relatively prevalent in Asia and the sequenced isolates from Kazakhstan allow to define a more basal branching point. This and the presence of B.16 subclade in Japan, Turkey, Tibet and Australia increase the likelihood that F. tularensis subsp. holarctica has an Asian origin [30]. The remaining 33 strains from Kazakhstan were assigned to the B.12 genetic group, which is widely distributed in Central to Eastern Europe [29,52].
Tularemia has been and still is the subject of important monitoring in Kazakhstan because of its zoonotic potential. A number of natural foci have been characterised in Kazakhstan and elsewhere, however the global ecology and phylogeography of this important pathogen remains poorly understood [55]. For instance, although the strict clonality of the evolution of the causal agent predicts that it emerged in one location in space and time, the geographic origin of F. tularensis is unknown. One reason for this situation is that in spite of major progress achieved in the past fifteen years owing to the emergence of powerful genotyping tools, and important efforts in terms of WGS production, data is missing for many countries. Another reason is that the phylogeography of F. tularensis is characterised by the finding of almost identical strains in very distant places and/or at very different dates [26,28,54]. A third reason is the poor knowledge of F. tularensis closest neighbors including F. novicida. It is the knowledge of closest neighbors which provided the strongest argument for an African origin of both Mycobacterium tuberculosis and B. anthracis [56,57].
The work presented here is confirming the remarkable features of the phylogeography of F. tularensis, in particular the growing evidence in favor of the capacity of the bacteria for longterm maintenance combined with a capacity for long-distance spread. This would suggest that F. tularensis has a capacity of «perpetuation», i.e. long-term maintenance without genetic changes, in a state implying a low level of replication [58]. A water-borne viable but non-culturable state has been proposed to explain long-term maintenance [59,60]. However, the available evidence is still fragile and limited. Also this behavior does not seem to fit with the relatively homogeneous branch length from root to tips observed in F. tularensis [22]. Data from many geographic areas is lacking and in addition, most investigations so far were restrospective analyses of historical collection. The analyses required the reculturing of the strains to prepare new DNA extractions. One cannot formally exclude cross-contaminations and additional studies will be required to confirm the present observations. In particular, it would be highly recommended to prepare a DNA batch for later sequencing purposes at the time of the very first isolation and growth of the pathogen. This DNA batch could be validated by MLVA7 in vitro genotyping and the resulting data could be archived in an online F. tularensis MLVA database. This initial genotype could serve as a DNA identification control of the eventual WGS data which could be in silico genotyped at least if sequencing reads are longer than 300 bp.
Interestingly, owing to the extensive field sampling in natural foci regularly done in Kazakhstan, a strain was recovered from an Isabellina wheatear incidentally trapped out of a rodent burrow in West Kazakhstan [49]. Although birds have been suspected as potential spreaders of F. tularensis [61], we found very little direct evidence in the litterature involving birds, either migratory [49,50] or scavenger [62]. The Isabellina wheatear is a migratory insectivorous bird. In spring, it lay eggs in burrows of rodents. Between March and August, birds will visit a number of burrows. After that, winter migrations will start, and rodents will come back to burrows to hibernate. Some fleas are shared between wheatear and rodents [63,64]. The Isabellina wheatear breeding geographic distribution ranges from North-East Greece to North East China. It winters in Africa, the Arabian Peninsula and India where it shares wintering areas with the Northern wheatear (Oenanthe oenanthe), which has an even more widespread geographic range, including Scandinavia, Alaska, North-East Canada. A potential role of the Isabellina wheatear in the propagation of plague has previously been proposed [64,65].
Such an implication would help explain some aspects of the phylogeography of the bacteria, including long-distance transportation and long term preservation (within the soil ecosystem of burrows). This would imply that the birds are healthy carriers. Birds would be infected in their breeding area, and would contaminate other migratory birds, such as the Northern wheatear, in their feeding grounds. The implication of migratory birds might also provide a working hypothesis regarding the dating of the emergence of the clonal F. tularensis. It would have followed the end of the last glaciation peak about 20000 years ago. The Isabelline wheatear is considered to have arrived in the lower Volga region approximately 10000 years ago (reviewed in [64]).
Supporting information S1 Fig. Focus on the B.4 lineage. The WGS data from the 17 strains belonging to B.4 were mapped on reference genome OSU18 (assembly accession GCA_000014605) for SNP identification. Two hundred and sixty-six SNPs were called, the Maximum Parsimony tree has a size of 269 (homoplasia 1.12%). For each strain, the country of origin (three letters code), year of isolation, host, strain Id and canSNP assignment are indicated when known. Coloring reflects sublineage assignment. Branches longer than one are labelled with size. The blue star indicates the position of the MRCA. Ninety-six SNPs were identified by mapping on the FSC200 genome sequence (assembly accession GCA_000168775). The Maximum Parsimony tree has a size of 96 (no homoplasy). Branch length from the MRCA indicated by the purple star to the tips varies from two up to eleven SNPs. Circles are labelled with the three-letters country code, year of isolation, strain Id and host. They are colored according to geographic origin. (TIF) S1 Table. List of assemblies and sequence reads evaluated. The "map_SCHU_S4_percent" column indicates for each dataset the percentage of the SCHU_S4 genome (assembly accession GCA_000008985) which does not contribute to the SNP search. This percentage corresponds to regions deleted in the strain being mapped, and also to repeated sequences. The "remarks" column provides a list of duplicate entries (usually strains for which both assemblies and raw data have been deposited, and also reference strains which have been independently sequenced). Entries labelled as "redundant" are entries showing an identical wgSNP profile with an other entry. The last four columns indicate which public datasets were used for the making of the indicated figures. (XLSX) S2