Using affinity propagation clustering for identifying bacterial clades and subclades with whole-genome sequences of Francisella tularensis

By combining a reference-independent SNP analysis and average nucleotide identity (ANI) with affinity propagation clustering (APC), we developed a significantly improved methodology allowing resolving phylogenetic relationships, based on objective criteria. These bioinformatics tools can be used as a general ruler to determine phylogenetic relationships and clustering of bacteria, exemplary done with Francisella (F.) tularensis. Molecular epidemiology of F. tularensis is currently assessed mostly based on laboratory methods and molecular analysis. The high evolutionary stability and the clonal nature makes Francisella ideal for subtyping with single nucleotide polymorphisms (SNPs). Sequencing and real-time PCR can be used to validate the SNP analysis. We investigate whole-genome sequences of 155 F. tularensis subsp. holarctica isolates. Phylogenetic testing was based on SNPs and average nucleotide identity (ANI) as reference independent, alignment-free methods taking small-scale and large-scale differences within the genomes into account. Especially the whole genome SNP analysis with kSNP3.0 allowed deciphering quite subtle signals of systematic differences in molecular variation. Affinity propagation clustering (APC) resulted in three clusters showing the known clades B.4, B.6, and B.12. These data correlated with the results of real‐time PCR assays targeting canSNPs loci. Additionally, we detected two subtle sub-clusters. SplitsTree was used with standard-setting using the aligned SNPs from Parsnps. Together APC, HierBAPS, and SplitsTree enabled us to generate hypotheses about epidemiologic relationships between bacterial clusters and describing the distribution of isolates. Our data indicate that the choice of the typing technique can increase our understanding of the pathogenesis and transmission of diseases with the eventual for prevention. This is opening perspectives to be applied to other bacterial species. The data provide evidence that Germany might be the collision zone where the clade B.12, also known as the East European clade, overlaps with the clade B.6, also known as the Iberian clade. Described methods allow generating a new, more detailed perspective for F. tularensis subsp. holarctica phylogeny. These results may encourage to determine phylogenetic relationships and clustering of other bacteria the same way.

Introduction Francisella (F.) tularensis is the causative agent of tularemia and a highly infectious, Gram-negative, bacterial pathogen [1][2][3]. F. tularensis is listed as a category A bioterrorism agent, not only due to its very low infectious dose but also due to its pathogenicity and its virulence, the ability to multiply within host cells and its ability of intracellular survival [4,5]. Tularemia is a febrile disease that may be severe to fatal. Prompt antibiotic treatment avoids severe complications [6]. The major causes of tularemia in humans are the two subspecies F. tularensis subsp. tularensis and F. tularensis subsp. holarctica [7]. Only the less pathogenic F. tularensis subsp. holarctica is endemic in Europe [2,8]. In Germany and France, most human infections are caused by contact with infected European brown hares (Lepus europaeus) [1,[9][10][11].
Genome sequencing and analysis have been performed on several F. tularensis strains. Francisella species share many biological and genomic attributes [12][13][14]. All Francisella isolates have small conserved genomes of about 1.8 Mb. They have a high degree of genetic similarity and are thus monomorphic. They have average nucleotide identity (ANI) of � 97.7% [15,16] and even � 99% within the subspecies F. tularensis subsp. holarctica [17]. However, virulence and pathogenicity are significantly influenced by slight genetic and functional differences [18][19][20][21][22][23][24]. Because of the limited genetic variation in F. tularensis, single nucleotide polymorphisms (SNPs) are the preferred genetic markers for molecular typing. So far, the standard of classifying sequencing data of Francisella [25] is a reference-dependent (full or draft genomes are used as reference) and alignment dependent (i.e. the method for calculation of a distance matrix of sequences) analysis and is termed canonical SNP typing (canSNPs). The canSNPs or other reference-dependent and alignment-dependent methods are used to unravel the relationship among of closely related populations. Additionally, a classification with canSNPs is dependent on the selected reference genomes and based on alignment and is not as mathematical independent as the methods presented. On the other hand, reference independent and alignment-independent methods exist that are not limited to a reference genome.
They enable alignment of sequences without dependency on a reference genome. These methods (e.g. kSNP (reference and alignment independent), Parsnp (both reference-dependent and-independent), and average nucleotide identity (ANI) (reference and alignment independent)) and allow for resolving phylogenetic relationships, based on objective criteria.
The phylogenetic clustering reveals shared evolutionary trajectories. The most commonly used method is based on Bayesian analysis of population structure (HierBAPS), but it relies on pre-specified population structures. HierBAPS allows clustering of the phylogenetically related groups, but previous assumptions are needed [26]. HierBAPS is used as a hierarchical approach using Bayesian model-based DNA sequence clustering, where data from a cluster at a particular stage of the hierarchy are clustered in the next stage, providing a useful way of increasing statistical power to detect separate lineages residing within the data [26]. HierBAPS relies on metadata. It provides a way of recognizing similarities and differences in the distribution of similar genotypes.
To be independent of presumptions, we have devised a strategy enabling us to resolve phylogenetic cluster of bacteria also independently. The reference independent and alignmentindependent method (affinity propagation clustering-APC) has successfully applied for clustering of bacterial genes and viral (Fischer, 2018) nucleotide sequences but never on whole genomes.
The aim of the study was to evaluate an analysis pipeline that utilizes a reference independent and database independent clustering method (APC), and that therefore provides a maximum level of objectivity in characterizing the phylogenetic relationships among clusters. A comprehensive panel of German Francisella isolates of equal quality served as an ideal data set with optimal data quality and data quantity (geographic data, PCR/qPCR and MALDI--TOF-MS)). In order to draw a comprehensive picture three different types of SNPs sets: core genome SNP, whole-genome SNP, and ANI (average nucleotide identity) were used. Together, these methods consider small, medium, and large differences. The phylogenetic clustering reveals shared evolutionary trajectories. The results were compared with the so far gold standard HierBAPS.

Materials and methods
We assessed the phylogeny of F. tularensis subsp. holarctica with 152 samples from Germany and Austria collected in the years 2006-2018. The microbial phylogeny of this bacterium was assessed using assemblies from sequenced isolates. Reference genome independent methods covered the core genome, all genomic SNPs, and the ANI. First, for core genome analysis the Harvest suite was used reference independently [27]. Second kSNP3.0 was used, a k-mer based tool that can identify SNPs in hundreds of microbial genomes without the requirement for genome alignment or a reference genome [28,29]. Third ANI analysis was done to verify the results independent of references and gene prediction. All the results were compared with laboratory methods (real-time PCR assays targeting canSNP loci) and established NGS-based methods with reference-dependent canSNPer. Then the reference-independent molecular typing tools were combined with APC, HierBAPS and SplitsTree.

Bacterial strains
For phylogenetic analysis, bacterial strains were chosen from the collection of isolates and sequences maintained at the National Reference Laboratory of Tularemia at the Friedrich-Loeffler-Institut, Jena, Germany. F. tularensis subsp. holarctica strains were handled following German biosafety regulations. For this study, F. tularensis subsp. holarctica isolates collected in the years 2006-2018 were used and are in part publically available [1,9,11,[30][31][32][33][34]. All isolates were manually checked for completeness of metadata, e.g. year of isolation, geographical origin, and host (S1 Table, only the district is disclosed). All strains were characterized using a combination of independent methods including MALDI-TOF MS, conventional PCR and selected strains with real-time PCR assays as previously described [30,35] (see S1 Table). Due to financial and temporal limitations, only a subset of the strains was assigned to clades using a set of real-time PCR results [10,32]. Isolates originate from hares (Lepus europaeus), edible dormice (Glis glis), red foxes (Vulpes vulpes), and ticks (Ixodes ricinus). The F. tularensis subsp. holarctica strains used in the present study were cultivated on cysteine heart agar (CHA, Becton Dickinson, BD Heidelberg, Germany) from animal carcasses or ticks harvested from these carcasses. The cultivation of bacteria from organ specimens was performed on cysteine heart agar at 37˚C with 5% CO 2 for 48 h. Prior to further handling isolates were inactivated at 95˚C for 20 min. Sequences of reference strains NC_009749.1 (clade B.6), NC_01746 (clade B.4) and NC_019551 (clade B.12) were obtained from the NCBI database.

DNA extraction and genome sequencing
DNA for whole-genome sequencing was prepared from a 10 ml culture in brain heart infusion broth (Sifin, Berlin, Germany). Bacterial cells were harvested after 72 h by centrifugation and the DNA was extracted and purified using QIAGEN Genomic-tip 20/G and a QIAGEN Genomic DNA buffer set kit (Qiagen, Hilden, Germany) according to the recommendation of the manufacturer. The DNA quality was examined by using a Qubit 2.0 fluorometer (Life Technologies, Darmstadt, Germany) and by agarose gel electrophoresis.

Sequencing, assembly, annotation and genomic analysis tools
The isolates were subjected to Illumina HiSeq and/or MiSeq sequencing using the adjusted Nextera XT or HT DNA protocol for library preparation (GATC, Konstanz, Germany; BfR, Berlin, Germany or in-house). The number of reads after filtering ranged from 0.5 million to 5 million. At least 100,000 paired-end reads were generated and filtered with a Phred score averaging >38. The libraries were tested for contamination by analysis with Kraken [36] and MetaPhlAn [37]. Contaminated datasets were removed. Further processing included quality trimming and assembly (included in SPAdes 3.12.1. in Bayes Hammer mode [--careful], [38]). Analysis of data was performed with QUAST v4.3 and Bandage 0.8.1 using standard settings [39,40]. Filtering was performed by removing contigs with k-mer coverage less than 5x and size below 500 bases. Only assemblies without possible contamination or incomplete sequencing were allowed by excluding assemblies >4 Mb and <1 Mb of predicted total length. TempEst was used for the visualization and analysis of temporally sampled sequence data [41] as specified before [11].

Accession numbers
New Assemblies have been stored in a BioProject that has been deposited at NCBI under the accession PRJNA560345. Data from Bioprojects PRJNA464279, PRJNA422969 and PRJNA353900 had been used.

Phylogenetic analyses
We chose three different reference-independent methods to estimate the evolutionary distances between the investigated strains: Parsnp within the Harvest suite [27] to detect the core genome SNPs based on genome alignment, kSNP3.0 [28] to report the whole-genome SNPs in an alignment independent approach relying on kmer analysis, and the python script pyani [42] to report percentages of average nucleotide identity between the strains (S1 Fig). Parsnp and kSNP3 were used in standard settings. However, the program pyani was used with (-m ANIm) employing MUMmer (NUCmer) for alignment. From the core genome alignment produced by Parsnp, we constructed a maximum likelihood phylogeny with RaxML [43] using the GTRGAMMA model rate of heterogeneity and supported by a bootstrapping test with 100 resamples. All results of Parsnp, kSNP3.0 and pyani were compared with the laboratory typing results.
Using iTol [44] and the results from tree analysis of kSNP3.0, the distance matrices could be sorted according to the phylogenetic analysis with kSNP3.0 (S2 Fig). With core genome SNPs multiple meta-alignment files were used for SplitsTree [45]. These data were used to investigate the temporal signal of German Francisella in an outbreak scenario and 'clocklikeness' of molecular phylogeny using the tool TempEst results in R2 = 6,7 � E-2 value, less than 0.5, suggesting weak clock-like behavior. The regression slope (rate) included negative values.

Clustering
To group bacteria in a phylogenetically related cluster, an application of a data mining technique called affinity propagation clustering (APC) is used. This clustering technique needs no presumptions or supervision, unlike other clustering methods used for bacteria [46]. With these methods, we would like to generate new insight and a new method to cluster bacteria. APC is fast and mathematically independent. This enabled a novel detailed view on F. tularensis subsp. holarctica epidemiology in Germany. The APC results were compared to the results of the HierBAPS clustering [26], which needs predefined parameters. SplitsTree reveals a different angle being alignment-based but unrooted and assumption-free.
Two clustering methods were compared: APC [47] and HierBAPS [26]. APC relies on distance matrices. All three phylogenetic methods yielded distance matrices to be used in APC. Pairwise distances of 152 isolates and 3 references as calculated in the phylogenetic analysis of Parsnp, kSNP3.0, and pyani were merged into a distance matrix each and imported to the statistics software R [48]. For further analysis, the package "apcluster" was used essentially as described. By default, the APC algorithm determines one sequence among the set of input elements for each potential cluster, which is most representative of this cluster. In APC terminology, these sequences are termed "cluster exemplars". Since this method was initially developed for the analysis of similarity matrices, the distance matrix from the sequence alignment had to be converted by inverting the values. Also, all values embedded in the matrix were squared to improve the robustness and discriminatory power of the analysis. Subsequently, the APC algorithm computes the minimum (pmin) and maximum (pmax) of the input preference (p), which is defined as the tendency of each sequence to become a "cluster exemplar" [47]. To define the optimal input preference (p), the number of cluster for the complete preference range (pmin-pmax) was calculated stepwise. The optimal input parameter for intraspecific analyses, i.e. the optimal number of cluster, was defined as the largest range of input parameters for which a constant number of cluster is calculated. This range is termed "plateau" throughout the manuscript. Methodologically, the beginning of the lower bound of the "two cluster plateau" cannot be defined and therefore the length of this plateau was not considered further. According to the now defined optimum input parameter, APC calculated the respective number of cluster and allocated any input sequence to only one of these. Results of APC were summarized on the country level and exported as CSV file into the GIS software QGIS (Version 2.18 "Las Palmas"; QGIS Development Team (2019)). QGIS Geographic Information System. Open Source Geospatial Foundation Project. http://qgis.osgeo.org). Each sequence dataset was completed with the assigned clade using PCR/qPCR results.
As comparison HierBAPS in the standard-setting was used [26]. As an unrooted phylogenetic network SplitsTree [45] with standard-setting using the aligned SNPs from Parsnp. The statistics program silhouette analysis was used to evaluate the separation distance between the resulting cluster.

Results
In total, 152 F. tularensis subsp. holarctica isolates were sequenced and assembled to assess the phylogenetic relationships between them. Three reference sequences, showing the three major clades of F. tularensis subsp. holarctica were added from the NCBI database (NC_009749, NC_017463, NC_019551). After quality and contamination filtering (excluding data with a PHRED score >38 and contamination of more than 20% reads from other species). The quality assessment showed that individual sequence lengths of the assemblies were 1.8 million nucleotides derived from 79 to 145 contigs. Altogether the sequences had an average G+C content of 32.2%.

Phylogenetic analysis
First, a core genome analysis was conducted with the Parsnp [26] (Fig 1). The mandatory � 97% ANI was verified (S1 Fig). Using the phylogenetic analysis three cluster could be visually distinguished. These cluster are in concordance with clades analyzed in the laboratory by PCR as B.4, B.6 and B.12 (Clade, PCR verified, S1 Table). Additionally, the subclade B.7 and B.71 within the clades B.6 and B.12 (with the subclade definition used by the canSNPer and the real-time PCR) could be visually distinguished and verified.
Second, within the maximum likelihood tree of kSNP3.0, the three cluster described above were present. In addition, two subclades B.7 and B.71 could be distinguished by trend (Fig 2). The kSNP3.0 analysis was followed by RaxML. For the parsimony analysis, 3,985 SNPs were used. In these analyses, 97.1% (299 out of 308) of all nodes showed bootstrap values higher than 70%. One visible subcluster within the cluster B.6 correlated to the canSNPer classification of B.7. The other potential subcluster within the subgroup B.12 correlates to the canSNPer classification of B.71. Not in all cases, canSNPer analysis correlated with the resulting three phylogenetic analyses described. In particular, the subcluster groups containing isolates B.45 and B.61 appear to be much more divergent. The subclades within the clade B.12 do not correlate continuously. KSNP3.0 reports a consensus of the equally most parsimonious trees. In total, 3,985 SNPs were reported for all genomes, compared to the 116 SNPs reported in the canSNPer (version Wittwer 2018).
As a third method, the determination of the ANI was used. The ANI of all isolates was extremely high identical within the isolates with more than 99.9% identity. The three clades mentioned above could be distinguished and to a lesser extent B.7 and B.71.
SplitsTree, an alignment-dependent but assumption-free method was used to generate unrooted phylogenetic networks from molecular sequence data. Given an alignment of coregenomic SNPs extracted from the Harvest suite, the program computed a network (Fig 3) and confirmed the three cluster (B.4, B.6, B.12) and to a lesser extent B.7 and B.71.

Clustering analysis
A dendrogram does not allow to answer the question if we should distinguish between three (B.4, B.12, And B.) or five clades (additionally B.71 and B.7). Dendrograms remain subjective. The clustering analysis provided an objective way to calculate of how many cluster are in a sample set. As a central result, we could define a three cluster demarcation within the F. tularensis subsp. holarctica isolates in Germany. APC assigned three genetic cluster in all distance matrices generated by the three methods (Parsnp, kSNP3.0, pyani) (Fig 4) and was comparable to real-time PCR assays in the laboratory. APC has a high computational efficiency [49], which substantially reduces the turnaround time. The main advantage is that it overcomes the described subjective criteria for cluster allocations with the help of mathematical algorithms. The results are non-hierarchically ordered cluster [49,50]. By application of APC to pairwise genetic distances from an alignment of all 152 + 3 sequences, the most stable distribution after iteration over all possible input parameters was determined as three cluster (Fig 4). Statistic support was calculated with the r package silhouette. The analysis calculated the separation distance between the resulting cluster.   Table). These are in accordance with the cluster B.4, B.6, and B.12 as tested before (S1 Table and S3 Table).
As another clustering method, HierBAPS is compared to APC. HierBAPS is not mathematical independent as affinity propagation. HierBAPS is used to detect the underlying population substructure dependent on SNP alignment, which is done by Parsnp. This approach involves the application of iterative clustering [47,51]. The software HierBAPS identified four major cluster at the first level of clustering and nine cluster at the second clustering level (S2 Table). The analysis of the distinguishing core genome SNPs from Parsnp alignment showed in each cluster a dramatic reduction of the SNPs (from 1,517 to 371, 358 and 65) by supporting the formerly obtained results. The subclades B.7 and B.71 showed a less dramatic reduction compared to the clades with 37 and 9 SNPs to their clades, respectively (S2 Table). The grouping of   i.e. the optimal number of cluster, was defined as the largest plateau (three AP cluster, see arrows). The y-axis represents the number of cluster while the x-axis represents the input parameter. A: based on core genome SNPs, B: based on whole-genome SNPs and C: based on ANI. Optimal input preference for intraspecific analyses, i.e. the optimal number of cluster, was defined as the largest plateau (three AP cluster) as methodologically, the beginning of the lower bound of the two cluster plateau cannot be defined. (Preference range ANI: between -3e-05 and 0; preference range kSNP: between -6 and 0; preference range Parsnp: between -4.5 and 0). https://doi.org/10.1371/journal.pntd.0008018.g004 subcluster B.71 in affinity propagation, whereas they belonged to clade B.12 based on laboratory methods and APC (S1 Table).
The geographical representation of all isolates is displayed with the APC designation ( Fig 5). The data provide evidence that Germany is indeed the collision zone were the clade B.12, also known as East European clade overlaps with the clade B.6, also known as Iberian clade [1,2].

Discussion
Whole-genome sequencing can be used for bacterial strain typing and epidemiologic analysis [52,53]. A whole array of bioinformatics tools are available for processing NGS data are increasingly open-source and of high quality [53][54][55]. There are two main analytical bioinformatics approaches to the exploitation of whole-genome sequencing data: reads can be aligned or subjected to de novo assembly [56]. The choice of strategy depends on the read length obtained, the availability of good reference sequences and the intended biological application. Alignment-based sequence comparison has several setbacks: alignment-producing programs assume collinearity, which is very often violated in the real world. The accuracy of sequence alignments drops off rapidly in cases where the sequence identity falls below a certain critical point, the generation of alignments is memory and time-consuming, an NP-hard problem and depends on multiple a priori assumptions about the evolution of the sequences that are being compared. For these reasons, we tested alignment-free assembly-based methods. Since 2012 de novo assembly strategies have progressed greatly [57]. The advantage of using assemblies is that there is no bias towards a reference genome. In addition, no template has to be adapted, and though it needs more reads and it is generally more fragmented, it works better for largescale and medium-scale differences. Mapping can be better for small-scale differences such as SNPs and small variations, however, new sequences that are completely different can be lost.
We tested and compared methods that are either dependent or independent of a reference of an assembly or an alignment. With a growing number of genome sequences the multiple alignments of homologous sequences followed by inference of a tree scale poorly. Therefore, independent 'alignment-free' methods should be preferably used (71). We aimed for a universally applicable workflow to analyze phylogenetic cluster of bacteria using Francisella tularensis subsp. holarctica as a model organism.
When using genomic variants for phylogenetic analysis, comparative genomics, or outbreak investigations, it is critical to properly evaluate the variant calling method and to re-evaluate them regularly. Core genome SNPs (Parsnp), whole-genome SNPs (kSNP3.0) and ANI (pyani) were compared to each other.
Parsnp combines the advantages of both, whole-genome alignment, read mapping, and can be scaled to thousands of closely related genomes. To achieve this scalability, Parsnp is based on a suffix graph data structure for the rapid identification of maximal unique matches (MUMs), which serve as a common foundation for many pairwise and multiple genome alignment tools. The core genome of the genome sequences is used to create multiple meta-alignment files (used in Parsnp analysis (26) and SplitsTree (46)). In the unrooted network analysis of SplitsTree, the three clades of Francisella isolates were independent of the rooted phylogeny distinguishable, reinforcing the made assumption.
kSNP3.0 is a program for SNP identification and phylogenetic analysis without genome alignment or the requirement for reference genomes. kSNP3.0 is based on k-mer recognition. It provides the highest resolution compared to canSNPer, Parsnp, and pyani. The kSNP3.0 analysis allowed deciphering even quite subtle signals of systematic differences in molecular variation (Fig 2) comparable to the Parsnp analysis. A direct comparison to the canSNPer analysis that allows discreet, though more subjective subclustering could be generated (S1 Table, S3 Fig). The bootstrap values of more than 70% are recognized as a threshold for reliability [58], demonstrating that bootstrap support of nodes does not result in the delineation of meaningful cluster, if large numbers of full genome sequences are analyzed [59]. The kSNP3.0 allowing an analysis of at least 34-fold higher resolution than canSNPer. The higher resolution assigned only 3 cluster compared to more than 80 in the canSNPer. Three cluster could be validated with three independent methods and physiological relationships (for example erythromycin resistance [60]). It would be interesting to see if with an independent method the canSNPer designs really show relevant cluster and physiological relationships. It might be overdesigning the cluster. Nevertheless, all used data correlated mainly with the phenotype, canSNPer, and qPCR results (S3 Fig).
The ANI is a measure of nucleotide-level genomic similarity especially suitable for major differences. Although Francisella is highly monomorphic, the clustering approach could be verified in a reference-independent and alignment-free independent method.
The aim of this study was to resolve if it is possible to determine phylogenetic relationships and clustering of bacteria to optimize outbreak analysis for Francisella tularensis isolates in Germany. Classification, clustering, and designation is a key step in phylogenetic and outbreak analysis required for understanding and adequate management of organisms. The familiar evolutionary model assumes a tree, a model that has greatly facilitated the discussion and testing of hypotheses. However, it is well known that such models (46) poorly describe more complex evolutionary scenarios. Therefore, we used SplitsTree in a combination with the core genome alignment of Parsnp to visualize cluster within an unrooted model. We added thus an independent approach to the phylogenetic analysis and verified the three detected cluster. In addition, the subclades B.7 and B.71 were distinguishable (S4 Fig). After phylogenetic analysis, a clustering is needed to allow cluster definition root-independent and objectively. Two clustering methods were compared: affinity propagation and Hier-BAPS. With APC it is not relevant to know the evolutionary dynamics of F. tularensis subsp. holarctica and it needs no further preconceptions as in HierBAPS. Affinity propagation provided meaningful clustering. It can be performed on whole-genome sequences [50,61].
We found that APC is particularly suitable for applications in clustering bacteria [47]. APC is a tool that was developed for clustering similarity measures between all pairs of input samples based on the concept of "message passing" between data points. Besides its computational efficiency, which substantially reduces the turnaround time, the main advantage is the overcoming of the described subjective criteria for cluster allocations with the help of mathematical algorithms. The results are non-hierarchically ordered cluster, which can have unequal cluster sizes [49,50]. Thus subjective cluster allocation can be made. By application of APC to pairwise genetic distances from an alignment of F. tularensis subsp. holarctica genomes, the most stable distribution after iteration over all possible input parameters were determined and three cluster equal to known clades were defined. This approach has already been successfully applied for various tasks in bioinformatics, e.g. for microarray and gene expression data [62][63][64][65] but not to our knowledge on whole-genome analysis. An extended panel of newly obtained full genomes sequences of F. tularensis subsp. holarctica was used to demonstrate the application of APC for bacteria and thus clade definition as well as for comparison with previous studies. Similar problems have been described for viruses and bacteria of other genera and alternative solutions have been developed [58,66,67]. APC seems particularly promising and could help to solve species delineations in asexual lineages where obligate gene exchange cannot be used as a delineation criterion [61]. The high speed of the APC algorithm gives the possibility to analyze very large data sets. We, therefore, believe that this algorithm is very useful for classifying cluster in other bacteria. The results were confirmed with the statistical analysis program silhouette (r-package cluster 2.1.0, [68]) to verify the cluster number and showing that the clustering had strong support.
We confirmed the reliability of the subclustering also with independent mathematical methods such as SplitsTree and HierBAPS. The clustering of the affinity propagation was verified with an independent method, HierBAPS [26] (S3 Table).
As comparison HierBAPS in the standard-setting was used as a hierarchical approach using Bayesian model-based DNA sequence clustering. Data from a cluster at a particular stage of the hierarchy are reclustered in the next stage, providing a useful way of increasing statistical power to detect separate lineages residing within the data [26]. A parallel approach was an unrooted phylogenetic network. The results of BAPS also n on the phylogenetic results when taking into account the first and second levels of BAPS clustering. The discovery of nested genetic population structures has been shown before [51]. Four clades were identified into two subcluster. The software HierBAPS split cluster B.4 and the subcluster equivalent to subcluster B.71, traditionally belonging to B.12, in the second level of clustering. One can assume that one of these clades showed more divergence between the strain, which could be divided into two subcluster in the second partitioning of the clades. This is a known issue for the BAPS algorithm, where a conservative clustering is applied. However; the population sample could not be uniformly divergent [26]. When we consider the second level of clustering, five clades and four outliers could be delineated, supporting the results of APC of three clades B.4, B.6, and B.12, and the subclades B.7 and B.71. The clustering algorithm is also known to benefit from large sample numbers and improve with more samples, so when more data are provided, the results might concur with the laboratory data. The cluster identified by APC was compared to HierBAPS. Both methods describe qualitatively distinct clades. HierBAPS' clusters are not preferable to APCluster's; because the algorithm needs preconceptions. The cluster are not in concordance with the canSNPer results and the computing time is longer. APC has the advantage that a mathematical independent objectively approach, is in consensus with laboratory data and has fast computing times.
For outbreak analysis, it is imperative to analyze the host, temporal and phylogeographic association within the underlying phylogenetic analysis. There is a strong indication of the correct assignment and support for the methods used (core genome SNP, whole-genome SNP, and ANI). For the host association not distinguishable, unfortunately, no confirmation of the route of infection tick-hare or hare-tick could be confirmed. The hosts of the subcluster B.7 were two edible dormice, a hare, and a fox showing an unusual host preference, bearing in mind that hares that were host to over 90% of the isolates in this data set. The low diversity of the dataset could lead to a biased interpretation and might be remedied in time. For the temporal and phylogeographic association, only nidal and random temporal and spatial distribution patterns can be described. To assess the temporal evolution the rates of molecular evolution are calculated by the product of the number of mutations that arise per replication event, the frequency of replication events per unit time and the probability of mutational fixation [69]. The regression slope (rate) found included negative values, suggesting that these rates are either too low or not enough with the used data set to allow reliable rate estimation. Similar data are obtained from Mycobacterium leprae. This suggests that the high specialization and success of F. tularensis subsp. holarctica leads to an evolutionary dead-end such as in Clostridium chauvoei [70]. An analysis with BEAST, as done before with only 67 samples [2] is not recommended, because the molecular clock models employed are statistically conditioned on having an evolutionary rate greater than zero [41]. That might be due to fewer samples or to a different assembler strategy, that had been shown to an over-and underestimated gene content and genome sizes up to 36% [11].
In time the sample number of isolates will be more comprehensive. When used with more isolates from different countries it might be even possible to establish a higher spatial and temporal resolution and thus to generate a highly standardized nomenclature for subpopulations. The geographic resolution is allowed within limits of a 400 km radius. However, for clonal organisms as Francisella, it can be difficult to decide where to set demarcations between groups [15,54,56,60,61,[71][72][73][74]. The geographic distance of indistinguishable isolates can be 275-400 km demonstrates the wide distribution of F. tularensis subsp. holarctica within Germany. The data provide evidence that Germany is indeed the collision zone were the clade B.12, also known as the East European clade overlaps with the clade B.6, also known as Iberian clade. In previous phylogenetic studies, cluster allocation of Francisella isolates was either based on qPCR, PCR, canSNPer, and region of origin. In the epidemiological collision zone, where the Clade B.6 and B.12 meet a boundary is perceivable. At this boundary changes of clades can occur in the spatial distribution (for example migration) and genetic changes (for example transfer of genetic elements). This should be carefully monitored. However, the allocation of F. tularensis subsp. holarctica into clades might biased and subjective because the thresholds of statistical support vary depending on the respective reference genomes and databases. Though one of the major subcluster B.7 and B.71 correlate well with traditional methods, further split subclades reflect the results from canSNPer and qPCR could not be consistently verified. Combining cluster identification, we could provide a simplified evolutionary scheme for F. tularensis subsp. holarctica. By applying affinity propagation on whole genomes of other bacteria, other clades and subclades in these pathogens will be accessible cost-efficient and free software. This should allow the rapid risk assessment in the setting of epidemics and outbreaks.