Phylogenomics of Leptospira santarosai, a prevalent pathogenic species in the Americas

Background Leptospirosis is a complex zoonotic disease mostly caused by a group of eight pathogenic species (L. interrogans, L. borgpetersenii, L. kirschneri, L. mayottensis, L. noguchii, L. santarosai, L. weilii, L. alexanderi), with a wide spectrum of animal reservoirs and patient outcomes. Leptospira interrogans is considered as the leading causative agent of leptospirosis worldwide and it is the most studied species. However, the genomic features and phylogeography of other Leptospira pathogenic species remain to be determined. Methodology/principal findings Here we investigated the genome diversity of the main pathogenic Leptospira species based on a collection of 914 genomes from strains isolated around the world. Genome analyses revealed species-specific genome size and GC content, and an open pangenome in the pathogenic species, except for L. mayottensis. Taking advantage of a new set of genomes of L. santarosai strains isolated from patients in Costa Rica, we took a closer look at this species. L. santarosai strains are largely distributed in America, including the Caribbean islands, with over 96% of the available genomes originating from this continent. Phylogenetic analysis showed high genetic diversity within L. santarosai, and the clonal groups identified by cgMLST were strongly associated with geographical areas. Serotype identification based on serogrouping and/or analysis of the O-antigen biosynthesis gene loci further confirmed the great diversity of strains within the species. Conclusions/significance In conclusion, we report a comprehensive genome analysis of pathogenic Leptospira species with a focus on L. santarosai. Our study sheds new light onto the genomic diversity, evolutionary history, and epidemiology of leptospirosis in America and globally. Our findings also expand our knowledge of the genes driving O-antigen diversity. In addition, our work provides a framework for understanding the virulence and spread of L. santarosai and for improving its surveillance in both humans and animals.


Introduction
Leptospira is a highly heterogeneous bacterial genus divided into pathogenic and saprophytic species and then further divided into more than 300 serovars, which are defined according to structural heterogeneity of the lipopolysaccharide (LPS) O-antigen.Nowadays, strain identification is mainly based on genome analysis, and core genome multilocus sequence typing (cgMLST) [1] enables identification of the species and below.Recent studies have also shown that whole-genome sequences can be used for predicting Leptospira serotypes on the basis of the rfb locus which contains the genes for the O-antigen biosynthesis [2,3].This approach offers a promising alternative to the conventional serotyping method, which is laborious, time-consuming, expensive and requires a high level of expertise.
Over the past decade, the number of Leptospira species described has rapidly extended from 22 in 2014 to 69 in 2022 [4], largely due to the use of improved protocols for culture isolation from the environment [5,6] and the generalization of next generation sequencing [7].Among the genus Leptospira, eight species (L.interrogans, L. kirschneri, L. noguchii, L. santarosai, L. mayottensis, L. borgpetersenii, L. alexanderi and L. weilii), which diverged after a specific node of evolution, constitute the most virulent group of pathogenic species [8].These Leptospira species are the causative agents of leptospirosis in both human and animals, leading to a high disease burden in tropical countries [9] and major economic losses in the livestock sector [10].
Our previous analysis of the distribution of pathogenic Leptospira species showed that L. interrogans is the most frequently encountered and globally distributed species [1].This cosmopolitan species is also by far the most studied in terms of virulence, and molecular epidemiology, among other aspects.On the contrary, to date, very little is known about the geographical distribution, reservoirs, genomic features and virulence factors of pathogenic species other than L. interrogans.In the same analysis, we showed that some pathogenic species were geographically restricted [1].Thus, only limited reports have described the existence of L. santarosai outside the American continent.L. santarosai, named after Carlos A. Santa Rosa, a Brazilian veterinary microbiologist who pioneered the study of leptospirosis in Brazil, was first described in 1987 [11].L. santarosai is predominant in many countries from Central and South America.

PLOS NEGLECTED TROPICAL DISEASES
In the present study, we first performed an analysis of the pangenome in pathogenic Leptospira species and then took a closer look at the genetic diversity of L. santarosai including a set of strains recently isolated from patients in Costa Rica, which is an endemic country for leptospirosis [12].
Phylogenomics analysis of L. santarosai genomes will enable to better understand the genetic diversity and genome features of this pathogenic species which is prevalent in most countries of the American continent.

Ethics statement
According to the decree number 40556-s of the General Health Law of Costa Rica, epidemiological studies that incorporate the review of clinical records do not require the approval of an ethics-scientific committee.Additionally, no written informed consent from patients was required, as the study was conducted as part of the routine diagnosis at the Centro Nacional de Referencia de Bacteriologı ´a of the Instituto Costarricense de Investigacio ´n y Enseñanza en Nutricio ´n y Salud (INCIENSA).No additional clinical specimens were collected for the purpose of the study.Human samples were anonymized, and collection of the samples was conducted according to the Declaration of Helsinki.

Strains
Isolates sequenced in this study (n = 153) were obtained from the collections of the French National Reference Center for Leptospirosis (Institut Pasteur, Paris, France), Laboratorio de Gene ´tica Molecular (Instituto Venezolano de Investigaciones Cientı ´fica, Caracas, Venezuela), Institut Pasteur of Alger (Algiers, Algeria), Institute of Veterinary Bacteriology (University of Bern, Switzerland), Molecular Epidemiology and Public Health Laboratory (School of Veterinary Sciences, Massey University, New Zealand), Instituto de Higiene (Facultad de Medicina, Universidad de la Repu ´blica, Montevideo, Uruguay), Universidade Federal Fluminense (Rio de Janeiro, Brazil), Faculty of Veterinary Medicine (University of Zagreb, Croatia), National Collaborating Centre for Reference and Research on Leptospirosis (Academic Medical Center, Amsterdam, the Netherlands), Laboratory of Zoonoses (Pasteur Institute in Saint Petersburg, Saint Petersburg, Russia), Institute for Medical Research (Malaysia), Faculty of Medicine and Health Sciences (University Putra Malaysia, Malaysia), and Leptospirosis Research and Expertise Unit (Institut Pasteur Nouvelle-Cale ´donie, Noume ´a, New Caledonia), Kimron Veterinary Institute (Israel).We also downloaded genomes from our previous studies including isolates from the collections of Lao-Oxford-Mahosot Hospital-Wellcome Trust-Research Unit (LOM-WRU) (Microbiology Laboratory, Mahosot Hospital, Vientiane, Lao People's Democratic Republic), Unidad Mixta Pasteur-Instituto Nacional de Investigacio ´n Agropecuaria (Institut Pasteur of Montevideo, Montevideo, Uruguay), Centre Hospitalier de Mayotte (France), and Department of Mycology-Bacteriology (Institute of Tropical Medicine Pedro Kourı ´, Havana, Cuba) [1,2,[13][14][15] as well as genomes from the NCBI database.Information on strains and genomes used in this study are indicated in S1 and S2 Tables.

Whole-genome sequencing
Illumina sequencing was performed from extracted genomic DNAs of exponential-phase cultures using a MagNA Pure 96 Instrument (Roche, Meylan, France).Next-generation sequencing (NGS) was performed using Nextera XT DNA Library Preparation kit and the NextSeq 500 sequencing systems (Illumina, San Diego, CA, USA) at the Mutualized Platform for PLOS NEGLECTED TROPICAL DISEASES Microbiology (P2M) at Institut Pasteur.CLC Genomics Workbench 9 software (Qiagen, Hilden, Germany) was used for analyses.The generated contig sequences together with the sample metadata are available in BIGSdb hosted at the Institut Pasteur (//bigsdb.pasteur.fr/leptospira/).We also downloaded additional genome sequences of Leptospira isolates from the NCBI database (S1 Table ).Only genomes meeting quality requirements, such as i) sequencing coverage >30x, ii) number of contigs <600, iii) cumulative contigs length within the typical range of Leptospira genomes (3.6-6Mb), iv) GC content within the typical range of Leptospira genomes (35-48%), and v) <100 uncalled cgMLST alleles out of the 545 pre-defined core genes, were selected for further analyses.

Genomic analyses
Comparative analyses of the pangenome were performed using two software: Roary version 3.11.2[16], and a combination of COG and OMCL algorithms in GET_HOMOLOGUES version 20190411 [12].Both methods yielded a similar number of gene clusters.In the Roary analysis, a 60% identity cut-off was applied to define gene clusters (option -i 60), and no other parameters were modified.Among the Roary outputs, a tab-separated file containing the number of genes in the pangenome was used to create a graph depicting the variation in the number of gene clusters as a function of the number of genomes analyzed.Roary iterated 10 times, calculating the number of new genes added as each genome was sequentially incorporated into the analysis.This graph facilitated a quick determination of whether the pangenome was open or closed and allowed for the calculation of the α coefficient in Heap's Law (n = κNγ, with γ = 1α) [17].On the other hand, GET_HOMOLOGUES was used to infer the pangenome distribution in cloud-, shell-, soft-core-, and core-genome.This was achieved by generating a tabseparated pangenome matrix file that included the number of all the clusters identified by both COG and OMCL algorithms.The matrix represented the intersection of the two methods and served as input for the parse_pangenome_matrix.pl script within GET_HOMOLOGUES, which classified the clusters as cloud (shared by up to 2 genomes), shell (shared by more than 2 genomes but less than 93% of genomes analyzed), soft-core (shared by 93-99% of genomes), or core-genes (shared by 100% of genomes).Due to the substantial number of genomes available for L. interrogans and L. borgpetersenii, as well as the redundancies observed in serogroups and serovars, representative genomes of each serogroup/serovar were selectively chosen to mitigate computational costs.Excluding genomes with redundant identities is not anticipated to result in significant alterations in the pangenome distribution.
Genome size and GC content for highly virulent Leptospira species were determined through DFAST annotation [18].Individual values were plotted and grouped per species, with the mean and standard deviation displayed.Genome size and GC content were compared using the Kruskal-Wallis Rank Sum Test, for the comparison of Leptospira spp.and the Wilcoxon rank test, for the comparison of two phylogenetic-related groups.Post-hoc comparisons were performed using Dunn's Kruskal-Wallis Multiple Comparisons (Dunn, 1964).P-values were adjusted with the Bonferroni method.Statistical analyses were performed in R [19], using FSA package [20].
Average Nucleotide Identity (ANI) and Percentage of Conserved Proteins (POCP) were calculated for the 64 L. santarosai genome sequences as well as L. interrogans str.Fiocruz L1-130 and L. borgpetersenii str.M84 used as outgroups (S1 and S2 Figs).Genomes were annotated by Prokka version 1.13.7 [21].ANI and POCP matrices were inferred using OMCL algorithm via GET_HOMOLOGUES version 20190411 [12].Briefly, to calculate ANI, the option -A was employed along with option -a to utilize nucleotide sequences and perform BLASTN.This process generated a tab-separated file containing average percentage sequence identity values between pairs of genomes, calculated from sequences within all identified clusters (option t = 0).This tab-separated file served as the input to create a symmetric matrix, where the genomes were clustered based on their ANI values.Dendrograms based on this clustering were generated on both sides of the matrix to visually represent the proximity among genomes.Similarly, POCP was calculated by including the option -P and performing default BLASTP searches.This step yielded another tab-separated file, which was subsequently used to create a symmetric matrix.Analogous to the ANI matrix, the genomes were clustered based on the shared % of conserved proteins between pairs of genomes.These values are calculated as POCP = (C a + C b )/(total a + total b ), where C a and C b denote the number of conserved proteins from genome a in genome b and from genome b in genome a, respectively, normalized by the sum of total proteins in each genome.The clustering process also generated dendrograms, indicating the proximity among genomes in terms of conserved proteins.
Core genome MLST (cgMLST) typing was performed using a scheme based on 545 core genes as previously described [1].L. santarosai core-genome based phylogeny was constructed using the 1288 core-genes alignment resulting from Roary analysis (60% identity cut-off, option -i 60).The best-fit model and the maximum-likelihood phylogenetic tree were determined by IQ-TREE version 1.6.11[22], considering 10,000 ultrafast bootstraps [23].L. interrogans str Fiocruz L1-130 and L. borgpetersenii str.M84 were used as outgroups.Tree branches were transformed with the "proportional" option on FigTree software v1.4.4 (http://tree.bio.ed.ac.uk/software/figtree/), which adjusts branch distances according to the number of tips under each node to improve visualization of the tree.Gene presence/absence analyses among rfb clusters from different genomes here studied were performed by protein-level searches using BLASTP [24] and subsequent network associations by NetworkX version 2.6.2 [25].A similarity threshold of 60% was applied, as previously described [2].The resulting presence/absence table obtained from the network association analysis was converted into a binary CSV file, where 0 represents gene absence and 1 represents gene presence.This binary table was subjected to hierarchical clustering based on shared protein-encoding genes (options: euclidean distance, ward linkage) using available tools at //mev.tm4.org.Jaccard's similarity index was used to measure the similarity between rfb patterns.

Distribution of pathogenic Leptospira species shows that L. santarosai isolates are mostly from the Americas
We first investigated the geographical distribution of pathogenic Leptospira species using 914 genomes of isolates collected between 1928 and 2022 (S1 Table ).
Leptospirosis is endemic in most countries of South and Central America, as well as in the Caribbean region [9,[26][27][28].In addition, most outbreaks of leptospirosis have been reported in the Latin America and the Caribbean region [29], where the disease is widespread in domestic and wild animals [12].However, comprehensive data concerning human and animal leptospirosis remain largely scarce in most American countries [30].We previously studied the genomes of L. noguchii isolated from human and animals in America [2] but our knowledge of L. santarosai, the other prevalent species in America, is rather limited.
Here, we sequenced 18 L. santarosai strains, including twelve strains that were isolated in Costa Rica in 2020-2021 from patients.The ANI and POCP values were calculated for the 64 L. santarosai strains further confirming they all belong to the same species (S1 and S2 Figs).Of the 64 L. santarosai strains in our genome database, 28 were isolated in South America (Brazil, Colombia, Ecuador, Peru), 21 in Central America (Costa Rica, Panama), 6 in North America (US; not including Puerto Rico), 7 in the Caribbean region (Martinique, Guadeloupe, Trinidad and Tobago and Puerto Rico), and only two strains were isolated outside the Americas (China and Democratic Republic of the Congo) (Fig 1 and S2 Table ).Of note, L. santarosai has not been isolated in Uruguay, where a large number of Leptospira strains have been isolated from cattle [15].
Previous studies have shown that L. santarosai can be detected from different sources in many countries of America and the Caribbean region.It is the predominant species in humans, rodents and dogs in Peru and Colombia [31,32].In Peru, it has additionally been found in rural environmental water samples (but not in urban samples), as well as in association with pigs and cattle [33].In Brazil, L. santarosai has been isolated from dogs [34], cattle [35], goats [36], and capybaras [37].Moreover, it has also been identified in patients in French Guiana [38], Guadeloupe [39] and the US [40].
Only a few reports have described the existence of L. santarosai outside the American continent.Some years after the original description of L. santarosai [11], Brenner et al. listed 65 L. santarosai strains, of which only three were isolated from outside America [41].One L. santarosai strain was isolated from a patient in Sri-Lanka in 1966 but has never been reported in this country afterwards [42,43].The other two strains were isolated in Denmark and Indonesia but, again, L. santarosai has not been subsequently isolated in these countries.More recently, a strain belonging to L. santarosai serogroup Grippotyphosa was isolated from a patient in India and its genome sequenced [44].However, because of highly fragmented genome (884 contigs) and missing genomic data (135 uncalled cgMLST alleles), the cluster assignment was not possible for this isolate and we removed its genome from our analysis.
The serogroup Shermani, is commonly reported in serological surveys in animals in Asia [45][46][47][48].Unfortunately, there is no evidence that the infecting strains described in these studies were L. santarosai or another species such as L. noguchii and L. inadai which also contain serovars from the serogroup Shermani [49].Finally, the other country outside Americas where L. santarosai was reported is Taiwan in East Asia.Serogroup Shermani, presumably belonging to L. santarosai, is predominant among patients with severe leptospirosis in Taiwan [50].However, only one L. santarosai strain, strain CCF, has been isolated from a patient with leptospirosis in Taiwan [51] and this strain, for which we do not have the complete genome [52], is no longer available (personal communication of Prof Chih-Wei Yang).

Genome analysis shows species-specific features and an open pangenome for most pathogenic Leptospira species
Phylogenetic analysis of the eight pathogenic Leptospira species using the saprophyte L. biflexa as the outgroup shows two distinct groups as previously shown [8,53].One phylogenetic group constituted by L. santarosai, L. mayottensis, L. borgpetersenii, L. alexanderi and L. weilii Interestingly, with the exclusion of L. weilii, the genome sizes were not significantly different between species of the group I (Dunn´s test, all pair comparisons with p adjusted >0.05), supporting their characterization as a monophyletic group.However, the G+C% content did differ significantly between Leptospira spp.within group I. (Dunn´s test, all pair comparisons with p adjusted value �0.04).Variation was larger within the species of group II, as L. interrogans shows a larger genome size than L. kirschneri (Dunn´s test, p adjusted �0.0000003), whereas G +C% was significantly different between L. interrogans and L. kirschneri, and L. interrogans and L. noguchii (Dunn´s test, all pair comparisons with p adjusted value �0.0002).These disparities in GC content and genome size may be a response to long-term niche adaptation of pathogens which emerged hundred million years ago at the same time as the appearance of mammals [54].
Most pathogenic species exhibit a strong enrichment (>3X) of accessory genes (Fig 3).However, it is important to notice that L. mayottensis stands as an exception, where the analysis of 33 strains showed a distribution of 1,949 and 2,284 genes constituting the core and the accessory genomes, respectively (1.2X enrichment), which probably resides in the fact that L. mayottensis is restricted to the Indian Ocean (Mayotte and Madagascar) and its specific adaptation to tenrec, described as the main reservoir of this pathogenic species [55,56].

High genetic diversity of L. santarosai strains
To further investigate the genetic diversity of L. santarosai isolates, we used a core genome MLST (cgMLST) scheme [1] (Fig 4).The species L. santarosai (n = 64) were divided into 55 cgMLST clonal groups (cgCGs) showing a high intraspecies genetic diversity (Fig 4 ) as shown in previous studies [29,32,39,57,58].Among the 55 cgCGs, none is composed of more than 3 strains (S1 Table ), and none is composed of both human and animal strains.We cannot therefore identify transmission of L. santarosai clones between different hosts.There is a wide range of possible reservoirs for L. santarosai in the Americas.Some countries in the region are among the largest cattle producers in the world so these animals could be important reservoirs for human infections.The Americas also exhibit a great biodiversity, so many species of wild animals such as rodents, marsupials, and domestic animals, such as dogs may be involved in transmission cycles.Among the 15 strains isolated from patients in Costa Rica, only two (id1256 and id1260) exhibit the same clonal group further confirming the great diversity of strains even within one small country.
Unfortunately, analyses to identify associations of Leptospira genotypes to particular epidemiological variables (host reservoir, disease outcome, etc.) cannot be performed with our small sample size.However, we could determine some phylogeographic lineages.A clear geographical separation of the clonal groups was observed for strains from (i) Central America, comprising isolates from Costa Rica (15; all human strains), Panama (6) but also one strain from South America (Colombia); South America which was further divided in two divergent groups, (ii) one containing strains from Brazil (10; mostly bovine strains) and the other (iii) including strains from Peru (9); and (iv) Caribbean islands, with strains from Guadeloupe (2), Martinique (2), Trinidad (1) and Puerto Rico (1) (Fig 4).This suggests ancestral presence of this species in these different countries and further separated evolution with no or low The species L. santarosai (n = 64) were divided into 55 cgMLST clonal groups belonging to 9 different serogroups; for 20 strains the serogroup is unknown or undetermined.Colors indicate strains isolated from the same geographic region (Central America in red, South America in blue, North America in green, Caribbean in orange, Southern Asia in purple, and Middle Africa in black).Branch lengths were not used to ease readability of groups and isolates.//doi.org/10.1371/journal.pntd.0011733.g004 PLOS NEGLECTED TROPICAL DISEASES geographic diffusion.On the contrary, previous phylogenetic analyses of L. noguchii [2] and L. interrogans [1] did not reveal a correlation of genotype with geographical distribution.
Our analysis of the gene composition of the rfb cluster is consistent with previous studies showing that L. santarosai only contains serovars from serogroups Shermani, Hebdomadis, Tarassovi, Pyrogenes, Autumnalis, Bataviae, Mini, Grippotyphosa, Sejroe, Pomona, Javanica, and Sarmin [49], as well as probable new serovars.However, it must be noted that most of the L. santarosai genomic assemblies analyzed here are fragmented (average contig number, 149), Horizontal lines correspond to individual genes or set of genes grouped according to their percentage of similarity (cut-off 60%), green meaning presence, and black absence.Scales on the left of the matrices indicate the number of different genes being compared.Columns correspond to different Leptospira strains as indicated on the columns' labels.Those whose name begins with the species, correspond to the rfb cluster of known serovar/serogroup strains (serovar indicated in red, serogroup in brackets and in bold).Names of the L. santarosai strains analyzed in the present study begin with their ID number followed by the name of the strain.In brackets, serovar is indicated in red, if assigned, and serogroup in bold.For the latter strains, the whole genome was used and compared against the other rfb and a pan-rfb reference.Strains were organized after hierarchical clustering considering presence/ absence of rfb genes.Four major clusters can be evidenced: 1) including serogroups Australis, Grippotyphosa, Cynopteri, and Autumnalis; 2) the most variable, comprising serogroups Tarassovi, Pomona, Javanica, Pyrogenes, Celledoni, Icterohaemorrhagiae, Canicola and Sarmin; 3) composed only of serogroups Tarassovi and Shermani; and 4) including serogroups Sejroe, Hebdomadis and Mini.//doi.org/10.1371/journal.pntd.0011733.g005

PLOS NEGLECTED TROPICAL DISEASES
which could lead to inaccurate serotype assignations.Further studies should include less fragmented genomes, ideally closed genomes, for a correct interpretation of the rfb patterns and to better identify the genes determining the serovar identity.As previously shown [1] most serogroups had a polyphyletic distribution.Thus, isolates from the serogroups Tarassovi and Grippotyphosa did not all cluster together in the phylogenetic tree based on either cgMLST alleles (Fig 4

Conclusion
In conclusion, genome analyses showed species-specific genome size and GC content and an open pangenome in pathogenic species, with the exception of L. mayottensis.Taken together, these analyses suggest an ancient speciation of pathogens and their adaptation to diverse niches resulting in a great genotypic and phenotypic diversity across species.We also showed that despite the limited geographic distribution of L. santarosai to America, this species exhibits great diversity and an open pangenome.This study represents the largest and most detailed analysis of the genetic and serotype diversity of this pathogen to date, thus providing a comprehensive analysis of this pathogenic species.Our collection of L. santarosai exhibits an overrepresentation of isolates from America and more genomes representing undersampled regions and different animal reservoirs will be necessary to better understand the evolutionary history, epidemiology, and population dynamics of L. santarosai.We discovered a large genetic diversity among isolates from both human and animal samples, with no apparent transmission from one host to another, although circulation of strains that share the same serogroup was evident in multiple hosts.Outbreak investigations performed at the local level would likely improve the identification of animal reservoirs.These results will improve our understanding of the dissemination of genotypes in specific geographic regions and update the knowledge of strains circulating in America for effective disease surveillance.To perform the Jaccard similarity index calculation, each strain (column) was converted into a vector of 0 (gene absence) and 1 (gene presence), and then used to do pairwise comparisons between the vectors using a Python pipeline via NumPy.Jaccard similarity is represented in colors, according to the scale shown in the lower right insert.(PDF) S5 Fig. Core-genome based phylogeny of L.santarosai.Phylogenetic tree based on the sequences of 1288 core-genes of L. santarosai.A core-gene alignment was obtained by Roary (60% identity cut-off), and then used to perform the phylogeny.The best-fit model and the maximum-likelihood phylogenetic tree were determined by IQ-TREE version 1.6.11[5], considering 10,000 ultrafast bootstraps [6].L. interrogans strain Fiocruz L1-130 and L. borgpetersenii strain M84 were included as outgroups.The serogroup of each strain is indicated in parenthesis, as well as the world region (AF = Africa; CA = Central America; CB = Caribbean; NA = North America; SA = South America; SAs = South Asia).Bootstraps values other than 100% are shown.(PDF) Investigacio ´n Agropecuaria, Institut Pasteur de Montevideo), Dr Vasantha Kumari Neela (Faculty of Medicine and Health Sciences, Universiti Putra Malaysia, Malaysia), Dr Mohammad Ridhuan Mohd Ali (Institute for Medical Research, Malaysia), Dr Shlomo Blum (Kimron Veterinary Institute, Israel), Dr Nikolai Tokarevich (Laboratory of Zoonoses, Pasteur Institute in Saint Petersburg, Saint Petersburg, Russia), Dr Angelica Delgado (National Reference Laboratory of Bacterial Zoonoses, Lima, Peru), Dr Marga Goris (National Collaborating Centre for Reference and Research on Leptospirosis, Academic Medical Center, Amsterdam, the Netherlands) and Prof Chih-Wei Yang (Chang Gung University, Taiwan) for sharing strains and/or information on some of the strains used in our study.

Fig 1 .
Fig 1. Geographic origins of the most frequent pathogenic Leptospira species in our genome database (n = 914).Each pie chart corresponds to a given world region.As shown in our map, L. santarosai (n = 64) is mostly found in America (North America, Central America, South America and the Caribbean islands).The base layer of the map is freely available from outline-world-map.com.//doi.org/10.1371/journal.pntd.0011733.g001

Fig 3 .
Fig 3. Pangenome distribution in four categories (cloud, shell, soft core and core) for Leptospira pathogenic species.Analyses done with GET_HOMOLOGUES showing the U-shaped distribution of pangenome from L. santarosai, L. mayottensis, L. borgpetersenii, L. weilii, L. interrogans, L. kirschneri and L. noguchii.For L. interrogans and L. borgpetersenii, a subset of representative genomes of all sizes and from all geographic locations was selected to reduce computational costs and avoid redundancy.//doi.org/10.1371/journal.pntd.0011733.g003

Fig 4 .
Fig 4. Phylogenetic tree of L. santarosai strains.Maximum-likelihood phylogeny based on the variable sites of the cgMLST scheme consisting of 545 core genes showing the distribution of species, serogroups and geographic origins.The species L. santarosai (n = 64) were divided into 55 cgMLST clonal groups belonging to 9 different serogroups; for 20 strains the serogroup is unknown or undetermined.Colors indicate strains isolated from the same geographic region (Central America in red, South America in blue, North America in green, Caribbean in orange, Southern Asia in purple, and Middle Africa in black).Branch lengths were not used to ease readability of groups and isolates.

Fig 5 .
Fig 5. Gene presence/absence matrix of rfb clusters from different Leptospira strains and species, covering a range of distinct serogroup/serovar identities compared to L. santarosai strains.Horizontal lines correspond to individual genes or set of genes grouped according to their percentage of similarity (cut-off 60%), green meaning presence, and black absence.Scales on the left of the matrices indicate the number of different genes being compared.Columns correspond to different Leptospira strains as indicated on the columns' labels.Those whose name begins with the species, correspond to the rfb cluster of known serovar/serogroup strains (serovar indicated in red, serogroup in brackets and in bold).Names of the L. santarosai strains analyzed in the present study begin with their ID number followed by the name of the strain.In brackets, serovar is indicated in red, if assigned, and serogroup in bold.For the latter strains, the whole genome was used and compared against the other rfb and a pan-rfb reference.Strains were organized after hierarchical clustering considering presence/ absence of rfb genes.Four major clusters can be evidenced: 1) including serogroups Australis, Grippotyphosa, Cynopteri, and Autumnalis; 2) the most variable, comprising serogroups Tarassovi, Pomona, Javanica, Pyrogenes, Celledoni, Icterohaemorrhagiae, Canicola and Sarmin; 3) composed only of serogroups Tarassovi and Shermani; and 4) including serogroups Sejroe, Hebdomadis and Mini.

S2Fig.
Percentage of Conserved Proteins (POCP) among L.santarosai strains.L. interrogans strain Fiocruz L1-130 and L. borgpetersenii strain M84 were included as reference of distinct species.POCP percentages are represented as colors of the square matrix elements, according to the scale shown in the upper left insert.The names of the 64 L. santarosai strains are indicated on the right of the matrix.Clustering shown on the left side (and upper side, symmetric) of the matrix table, was performed by GET_HOMOLOGUES version 20190411.(PDF) S3 Fig. Pangenome analysis of L.santarosai strains.The graph was obtained with Roary 3.11.2(yielding a total of 13,168 gene clusters).The pangenome of L. santarosai presents an open profile, which was further verified by Heap's law [3], n = κNγ; considering a total of 13,168 genes (n) in the pangenome (according to Roary) and the 64 genomes (N) included, the observed curve allows for non-linear fitting with a constant κ = 2851.4.Thus, 1-γ = α = 0.63.An α value <1 indicates an open profile.Same result was obtained considering n = 13,156 (GET_HOMOLOGUES).(PDF) S4 Fig. Jaccard similarity matrix for the comparative analyses of the rfb clusters.Jaccard similarity matrix is organized according to the gene presence/absence matrix of rfb clusters shown in Fig 5.