Genomes of Leishmania parasites directly sequenced from patients with visceral leishmaniasis in the Indian subcontinent

Whole genome sequencing (WGS) is increasingly used for molecular diagnosis and epidemiology of infectious diseases. Current Leishmania genomic studies rely on DNA extracted from cultured parasites, which might introduce sampling and biological biases into the subsequent analyses. Up to now, direct analysis of Leishmania genome in clinical samples is hampered by high levels of human DNA and large variation in parasite load in clinical samples. Here, we present a method, based on target enrichment of Leishmania donovani DNA with Agilent SureSelect technology, that allows the analysis of Leishmania genomes directly in clinical samples. We validated our protocol with a set of artificially mixed samples, followed by the analysis of 63 clinical samples (bone marrow or spleen aspirates) from visceral leishmaniasis patients in Nepal. We were able to identify genotypes using a set of diagnostic SNPs in almost all of these samples (97%) and access comprehensive genome-wide information in most (83%). This allowed us to perform phylogenomic analysis, assess chromosome copy number and identify large copy number variants (CNVs). Pairwise comparisons between the parasite genomes in clinical samples and derived in vitro cultured promastigotes showed a lower aneuploidy in amastigotes as well as genomic differences, suggesting polyclonal infections in patients. Altogether our results underline the need for sequencing parasite genomes directly in the host samples

Introduction Leishmania (Kinetoplastida, Trypanosomatidae) is a genus of parasitic protozoa transmitted by sand flies to a variety of mammal hosts including humans. Within mammals, the amastigote-stage parasites are strictly intracellular and infect a range of professional phagocytes. Leishmania cause a broad spectrum of clinical presentations, the most severe being visceral leishmaniasis (VL) which annually affects up to 90,000 new individuals [1] and has a 10% mortality rate [2]. Ongoing control programs, like the one running in the Indian subcontinent (ISC) [3] could be challenged by several factors, including drug resistance [4,5] and emergence of new parasite variants [6,7]. More than ever, close surveillance is required to avoid new epidemics, including molecular tracking of the etiological agent.
Whole genome sequencing (WGS) is increasingly used for highly discriminatory parasite molecular tracking. Using that method, we previously resolved the hitherto cryptic population structure of Leishmania donovani, the etiological agent of visceral leishmaniasis (VL), in the Indian subcontinent ISC [8]. We identified a core group (CG)-associated with the successive VL epidemics in the lowlands-itself structured into (i) six congruent monophyletic groups (ISC2-7) together with (ii) three other groups (ISC8-10) and several ungrouped isolates, with a less certain evolutionary history. We found in the Nepalese highlands an emerging population (ISC1), quite different from the CG [6,9]. We described an unprecedented level of aneuploidy and found genetic variants likely to underpin reduced efficacy of antimonial drugs in that region.
However, such studies suffer a major limitation: they currently rely on sequencing parasites isolated from patients and grown axenically in vitro, with two possible biases. First, there could be a major representation bias, because of the variable isolation success and the high frequency (up to 90%) of asymptomatic cases [10], currently unsampled. Second, we have shown experimentally major differences in the genome of L. donovani during the life cycle: in particular, aneuploidy was much lower and affected other chromosomes in the intra-cellular life stages (amastigotes) of infected hamsters, in comparison to cultivated promastigotes [11]. This genome plasticity may affect gene dosage and therefore have a biological impact via transcriptome and proteome changes [12]. If the phenomenon also occurs between amastigotes from human subjects and in vitro cultivated promastigotes, it could have a major impact on the search for molecular bio-markers for clinically important traits such as drug resistance or virulence, as relevant variants present in the host could be diluted or disappear in culture.
Applying parasite WGS directly in host samples thus appears critical for next generation genetic analyses of natural populations of Leishmania. Such applications are challenged by the low amounts of Leishmania DNA in biological samples and the high abundance of DNA in the nucleated host cells. Accordingly, methods for enrichment of parasite DNA are required before undertaking WGS. A series of approaches have been used for other pathogens, including (i) removal of methylated host DNA [13], (ii) selective whole genome amplification [14] or (iii) targeted genome enrichment [15]. Here, we explored the last solution. Using Agilent Sure-Select technology, we designed an array for application in the L. donovani species complex. In this method an Illumina-based NGS sequencing library is prepared from the mixed genomic DNA sample, which is later hybridized with biotinylated RNA baits complementary to the Leishmania genome and the fraction of the input DNA that anneals to the baits is then sequenced (further called SuSL-seq). We validated and optimized this genome capture procedure with artificial mixtures of Leishmania and host DNA to simulate the proportions expected in clinical samples. Finally, we applied the method on a set of 63 bone marrow and splenic aspirates from Nepalese VL patients. In 12 of them, we compared the karyotype and genome sequence with that of isolates derived from the same tissue samples. We find genomic differences between the parasites analyzed directly in clinical samples and the derived isolates, further supporting the need for direct analyses in host samples.

Ethics statement
The ethics committee of the Nepal Health Research Council, Kathmandu (IRC/637/014) and the corresponding bodies at the Institute of Tropical Medicine of Antwerp and the Antwerp University, Belgium (B300201627694), reviewed and approved the study protocol and the use of already-existing samples. Informed written consent was obtained from each patient or his/ her guardian for those <18 years of age. All the patients and caretakers/parents had the study purpose explained to them in local language. A total of 63 anonymized samples (59 bone marrows, 4 spleen aspirates from clinically confirmed VL patients in Terai, Nepal) were collected at the B.

DNA extraction and next generation sequencing
DNA from parasite cultures and clinical samples (bone marrow and spleen aspirates) was extracted using QiaAmp DNA blood mini kit (Qiagen, Venlo, the Netherlands) following the manufacturer's instructions, and DNA concentrations were verified using the Qubit broadrange DNA quantification kit (Thermo Fisher Scientific, Waltham, USA). Sequencing was done following the manufacturer's standard cluster generation and sequencing protocols.
Standard Whole Genome Sequencing was applied to isolates. Genomic DNA was sheared into 400bp fragments by focused ultrasonication (Covaris Inc., Woburn, USA). Amplificationfree Illumina libraries were prepared [16] and 125bp PE reads were generated on the Illumina HiSeq v4.
Deep sequencing was applied to quantify parasite load in batch 1 of Bone Marrow (BM) samples. Genomic DNA was sheared into 150bp fragments by focused ultrasonication (Covaris Inc., Woburn, USA) and libraries were prepared using the SureSelect Automated Library Prep Kit (Agilent, Santa Clara, USA). Index tagged samples were amplified using KAPA HiFi DNA polymerase, (KAPA Biosystems, Wilmington, USA) and quantified using an Accuclear DNA Kit (Biotium, Fremont, USA), 75bp PE reads were generated on the Illumina HiSeq 2500.
SureSelect was used for Leishmania genome capture. Genomic DNA was sheared into 150bp fragments by focused ultrasonication (Covaris Inc., Woburn, USA) and libraries were prepared, either as described above, or using the NEBNext Ultra II DNA Library prep Kit (New England Biolabs, Ipswich, USA). Samples were tagged with a unique index, amplified and quantified by qPCR (S1 Text). Samples were then pooled into batches for SureSelect enrichment, based on qPCR estimates of parasite load, so that each batch had as little variation in parasite load as possible: we aimed for only a 2-fold variation within a batch, while optimising the use of the sequencing data. Based on our experience of enrichment levels obtained in the experimental samples, batch sizes varied from 6 or 7 samples per batch where they had less than 0.05% Leishmania gDNA content up to 8 samples for higher parasite load. Samples were pooled within batches in an equimolar fashion. Pooled material was taken forward for hybridization, capture and enrichment using the standard SureSelect Target Enrichment System XT2 protocol recommended by the manufacturer (Agilent Technologies https://www.agilent.com/cs/library/ usermanuals/Public/G9630-90000.pdf). Custom designed oligonucleotide baits were used at a 1:10 stock dilution. 75bp paired-end reads were generated on the Illumina HiSeq 2500.
Amplicon sequencing was used to explore the presence of minor genetic variants in clinical samples. In the first step we used the primers listed in S1 Text to amplify fragments that contained diagnostic SNPs/insertion for ISC3, ISC4, ISC5 and ISC5 groups using Advantage 2 Polymerase Mix (TAKARA, Kusatsu, Japan). 2μl of template with varying DNA concentration were used in a total of 50μl PCR reaction (S1A Dataset). Touch down PCR with the following conditions was used: 95˚C for 5 mins followed by 16 cycles of 95˚C for 30s, 68˚C to 60˚C (decrease of 0.5˚C with each cycle) for 30s, 68˚C for 30sec, followed by 19 cycles of 95˚C for 30s, 60˚C for 30s, 68˚C for 30s and a final step at 68˚C for 10min. Then, a second round amplification was performed, using 5μl of template, to add indexed Illumina adapters [17]. PCR conditions of the second round were 95˚C for 3 mins followed by 10 rounds of 98˚C for 20s, 65˚C for 30s, 72˚C for 30s finishing with 5 mins at 72˚C. The final products were purified using AMPure XP SPRI beads and quantified on a plate reader using an AccuClear DNA quantification kit (Biotium, Fremont, USA) followed by equimolar pooling. Due to low base diversity of the amplicons, 20% PhiX was spiked into the resulting pool and 300bp PE reads were generated on the Illumina MiSeq.

DNA read mapping, SNP and indel calling
SuSL-seq reads from clinical samples and data from 204 (54 strains, 150 isolates) previously sequenced L. donovani genomes [8] were mapped to the improved reference L. donovani genome LdBPKv2 [11], using Smalt v7.4 (http://www.sanger.ac.uk/science/tools/smalt-0). The methods were reported elsewhere [11]. Briefly, Smalt options for exhaustive searching for optimal alignments (-x) and random mapping of multiple hit reads were used. Reads were mapped when mapping base identity was greater than or equal to 80% to the reference genome (-y 0.8), to eliminate spuriously mapped reads mainly from human DNA. For the sequence analyses, all reads mapped to the reference, not limited to bait locations, were analyzed. For phylogenetic analyses and for evaluating the accuracy of SuSL-seq data, we used the population SNP and indel calling mode of UnifiedGenotyper in Genome Analysis Toolkit v3.4 with the SNP cut off 4000 used (GATK: https://software.broadinstitute.org/gatk/). No lower or higher depth cut off was used in the population SNP calling since the depth of some SuSL-seq samples was quite low.

Reference genome masking and SNP screening
To mask repetitive regions and regions prone to generate false positive SNPs due to the presence of non-Leishmania DNA in clinical samples, we masked 7,453 regions that spanned a total of 43,200 bases. These regions were identified by blast similarity hits [8] and by using SNPs that were generated by pure human DNA SureSelect analysis and samples with lower amount of Leishmania DNA. The masked positions were provided in MaskSuSL4.bed file. We also excluded a SNP cluster containing more than 7 SNPs within 60 neighboring bases to exclude false SNPs from simulated Leishmania DNA samples mixed with mouse DNA Mpd1 and Mpd2. This was effective to eliminate non-Leishmania reads with a dozen of SNPs but with highest mapping score 60.

Somy estimation
For somy estimation, 2 different methods were used. For samples with consistently high-quality mapping against Leishmania, such as genomic libraries from cultured promastigotes we used the median depth across each chromosome [11,18]. For samples whose read coverage depth was more variable, such as amplified and unamplified clinical samples, depth was calculated in 1000bp windows to allow a more uniform analysis of coverage variation [11]. Previous studies have shown some correlation between read coverage depth and chromosome length for some sequencing runs [11]. To correct this bias, we applied local median correction [11]. This method calculates somy based on the median coverage values of chromosomes with similar lengths. In the calculation of somy, the first 7,000 bases and last 2,000 bases of each chromosome were excluded because these regions tend to be repetitive telomeric regions and their depth was not reliable. Similarly, certain over-amplified chromosome regions were excluded from depth calculation for SuSL-seq samples: these regions were listed in S1B Dataset. The range of monosomy, disomy, trisomy, tetrasomy, and pentasomy was defined to be the full cell-normalized chromosome depth or S-value: S < 1.5, 1.5 � S < 2.5, 2.5 � S < 3.5, 3.5 � S < 4.5, and 4.5 � S < 5.5, respectively [11].

Gene copy number variation
Local copy number variants were detected as described elsewhere [8]. In brief, we calculated an average normalised haploid depth of H-locus (Ld23: 90426-104470) and M-locus (Ld36: 2558460-2576450) for each strain and then obtained an average normalised haploid depth and standard deviation for each group. For ISC1-specific CNVs, we calculated an average normalised haploid depth for each gene for each ISC1 strain and then obtained an average normalised haploid depth and standard deviation for each CG group, using the depth of 191 sequenced strains (promastigotes). ISC1 specific gene CNVs were then defined to be genes whose normalized haploid depth differed more than ± 0.3 from the baseline average depth of other ISC groups. The value 0.3 was chosen here because we have observed many long CNVs in trisomic chromosomes, which were amplified or deleted in 1 of the 3 homologous chromosomes.

Phylogenetic analysis
For phylogenetic analysis, 197 variables sites represented by 394 bases were used to perform neighbor-joining analysis using the p-distance model with 1000 bootstrap replicates as performed in MEGA 7.0.21 [19] and Splitstree v4 [20]. These analyses used 191 promastigote strains plus 51 clinical samples (one clinical sample from ISC1 was excluded because phylogenetically too divergent). Artificial mixtures of Leishmania and human DNA (Leishmania DNA percentage of 0.06%) were used to measure the accuracy to multiple infections of the phylogenetic analyses.

Identifying minor ISC genotypes and polyclonal infections with SuSL-seq reads
Two methods were used to identify minor ISC genotypes, respectively based on (i) SNP motifs specific of ISC groups and (ii) alternative allele frequency from sequenced reads.
We first extracted genotype specific 23-mer motifs that contained a SNP or reference base at the center. For ISC3, ISC4, ISC5, ISC6 and ISC7 groups, we created 26, 53, 38, 26 and 14 motifs that matched SNPs common to these groups, respectively. These group specific SNPs were identified using 191 samples analyzed in the previous study [8] and mapped to the pacbio reference [11]. To identify minor genotype reads which may have been missed by the standard SNP calling, we counted the reads in the alignment files that contained genotype specific motifs. These motifs were listed in S1A Dataset and used to estimate the proportion of ISC3, ISC4, ISC5, ISC6 and ISC7 in each sample.
In the second approach, a GATK population VCF file for bone marrow and promastigotes were converted to a table that contained alternative allele frequency. In addition, total read depth, reference read depth, alternative allele read depth and genotype information created for VCF file were kept along alternative allele frequency. We checked alternative allele frequencies at group specific SNP sites to identify polyclonal samples. A mean of alternative allele frequency was calculated by averaging alternative allele frequencies at all ISC specific SNP sites per sampling, excluding missing depth sites.

Identifying minor ISC groups using Amplicon sequencing
We applied a read count method and an exact motif matching method to identify minor ISC groups using amplicon sequencing. The read count method was based on samtools pileup and included reads marked as PCR duplicates, counting all four bases to access back ground base error rates. The exact motif matching method was based on a 17 mer for each diagnostic SNP and a 22 mer for the diagnostic insertion and they were listed in S1C & S1D Dataset.

Parasite cultures and competition experiment
Twelve clinical isolates paired to bone marrow samples were cultivated on HOMEM. Three L. donovani cloned strains were used in this experiment BPK067/0/cl2, BPK077/0/cl5, BPK091/ 0/cl9, respectively belonging to the ISC3, ISC6, and ISC4 genetic groups [8]. The parasites were grown on Tobie's blood agar [21] with a saline overlay at 26˚C to mimic the growth conditions during parasite isolation from clinical samples [22]. The parasites were inoculated to the concentration of 5x10 5 ml -1 , and passaged to fresh medium every 3-4 days. For each strain mixture, 3 replicates were prepared, and the pure strains were grown in duplicates. The proportion between two parasites strains was roughly verified at the start of the experiment, and was further assessed approximately every 5 passages.
To follow the proportions between two different parasite strains in culture, diagnostic SNPs, which allow distinction among different ISC genetic groups were used [23]. Specifically, the Ld36_p5263_F1 � and Ld36_p5263_R1, and Ld33_p3644_F1 and Ld33_p3644_R1 � primer pairs were used to distinguish between ISC3 and reference, and ISC6 and reference genotypes, respectively. The presence of these alleles was verified by standard Sanger sequencing, where primers mentioned above with � were used to sequence the amplified products. PCR amplicons were sequenced at Baseclear (Leiden, The Netherlands) with ABI 3730 (XL) DNA Analyzer, and the chromatograms were analyzed using the 4peaks software (Nucleobytes). The ability to detect both alleles was first verified by sequencing the PCR products amplified from DNA mixed in the same proportions as the parasites in the competition experiment. With this method, it was not possible to detect the exact proportions of the two alleles, but the relations between them (i.e. which allele was present in higher proportion) was assessed.

SuSL-seq analysis of clinical samples: sequencing statistics
Details on the SureSelect bait design are described in S1 Text and in S1A Fig. SuSL-seq was optimized on artificial mixtures (Leishmania DNA diluted in mammalian DNA in different ratios: see details and metrics in S1 Text and S1B and S1C Fig and S2 Fig): a percentage of Leishmania DNA of 0.006% was found to be the lowest limit for suitable analysis of genome diversity. SuSL-seq was then applied to 63 clinical samples with Leishmania DNA percentage of at least 0.006%, 59 bone marrow and 4 splenic aspirate samples collected from VL patients in Nepal in the same region where we previously studied genome diversity of L. donovani isolates (S3 Fig).
Among all SuSL-seq samples, the median depth of reads mapped against the L. donovani reference genome was 22.5, the 5 th and 95 th percentiles of median depth were 2 and 74.16, respectively, while the maximum depth was 126.5 (in BPK89BM) and the minimum average depth was 0.074 (in BPK161BM). Across all clinical SureSelect libraries, the mean percentage of reference genome bases covered by at least 5 reads was 61.1% and, the mean percentage of bases covered by at least 1 read was 78.4%. This coverage average was skewed by a small number of lower-quality samples. For instance, within a sequencing batch of 24 samples, over 90% of the genome was covered by at least one read in 14 samples; in all cases better genome representation was obtained following SureSelect enrichment (S4A Fig). Overall, as for the artificial mixtures, we obtained the highest enrichment for samples with lowest parasite loads. An approximately linear relationship was found between Leishmania DNA percentage in clinical samples and the proportion of reads mapping to Leishmania after SuSL-seq for samples with up to 1% of Leishmania DNA (Fig 1A). For samples with higher Leishmania content, no further increase was observed beyond around 60-70% of reads aligning to BPK282v2 reference genome ( Fig 1A). Eleven samples had low coverage, with less than 10% of the genome covered at a depth of 5x ( Fig 1B). These low-coverage samples also had the lowest parasite load, and were excluded from the analyses described below, so 52 samples were studied in greater detail.

SuSL-seq analysis of clinical samples: genetic variants
Using an improved version of the BPK282 reference genome BPK282v2 [11], we were able to call SNPs in these 52 samples, of which 51 were identified as belonging to CG and one belonging to the genetically distinct ISC1 sub-population [8]. We identified 197 variable sites that had sequence coverage and could therefore be reliably genotyped in all 51 CG clinical samples; these sites were also characterized in 191 previously sequenced parasite isolates from the CG [8] and 9 newly sequenced isolates (see below). The 197 sites were sufficient to reconstruct a phylogeny consistent with our previously published analysis [8], placing all clinical SureSelect samples within the previously identified ISC diversity for L. donovani (dots in Fig 2A). Out of the 51 CG samples, 40 clustered in one of the previously defined groups (ISC3-6, and ISC9). Five samples clustered at the base of ISC4. Six clinical samples clustered with isolates previously designated as 'ungrouped' (Fig 2A and S5A and S5B Fig), forming a novel monophyletic group that we have named ISC11. We verified the accuracy of our experimental and bio-informatic pipeline by including in the phylogenomic analyses SuSL-seq data from artificial mixtures (DNA of BPK282 diluted in human DNA): as expected, they clustered in the ISC6 group with the original BPK282 sequence data (S6 Fig). Notably, among the 11 samples that were excluded from the phylogenomic analysis because of poor genome-wide coverage, nine could be attributed to a specific ISC group, on the basis of diagnostic SNPs specific to these groups. The remaining two samples, BPK211BM and BPK161BM, had sequence data with too low coverage for any genotyping (S1E Dataset). Analysis of genotype distribution per village revealed sympatric genotypic diversity: for instance, in Itahari, four different ISC types were sampled over 11 days in June 2002 and three were sampled over four months in 2003 (S1F Dataset, available on request to the authors). Interestingly, in this village, two cohabiting members of the same family were sampled during each of these two periods: in both cases, two different ISC types were identified. We also searched for small indels that were unique to clinical samples. We identified here six unique indels, one of these in the LdAQP1 gene, and the 5 others in non-coding regions elsewhere in the genome. In a previous study, a fixed 2-nt homozygous indel within the coding region of the same LdAQP1 gene was found to be specific of the ISC5 group [8]. The LdAQP1 indel encountered in current study is different and introduced a1-bp frameshift within the LdAQP1 gene (S1G Dataset).
Next, we examined chromosome copy numbers. Consistent with our observations in artificially mixed samples (S1 Text), the apparent accuracy of our somy estimates in clinical samples depended on the Leishmania DNA percentages, as those with low percentage had lower normalized read depth and genome coverage (S1E Dataset). For example, chromosome 31 is known to be tetrasomic in all Leishmania species and isolates sequenced so far, except under experimental drug resistance selection where somy can be even higher [9]. As expected, S-values (normalized coverage depth, see Methods) of chromosome 31 were close to 4 in samples with high genome coverage. These values, however, were lower in samples with lower genome coverage, even though the somy of chromosome 31 was distinctively higher than 2 except for the very lowest coverage samples. To distinguish real somy variation and technical artefacts, we classified samples into group of similar genome coverage into those with high (36 samples, genome coverage above 89.1%), medium (16 samples, coverage 56.6-86.9%) and low quality (9 samples, coverage 9.8%-45.8%) somy calls (S1A Dataset and S7 Fig and S8 Fig). Two samples (coverage 1.1% and 0.8%), were excluded from this analysis because their depth was too low to quantify somy. The high-quality samples provided a clear picture of the average ploidy of amastigote populations in human hosts: most chromosomes showed an S-value around 2 (thus overall disomic), with only chromosome 31 showing consistently high somy around 4 ( Fig 2B). Other exceptions were chromosomes 16 and 18 which showed S-value of 1 (thus overall monosomic) in BPK89BM, and in BPK146BM, respectively (Fig 2B and S7 Fig.). Chromosome 17 showed a S-value of 1 or lower than 2 (suggestive of mosaicism between monosomic and disomic variants) in seven samples: BPK276BM, BPK104BM, BPK146BM, BPK161BM, BPK198BM, BPK191BM, BPK4BM (Fig 2B and S7 Fig. and S8 Fig). We detected two main local copy number variants (CNVs) specific of CG (termed M-and H-locus, [8]) in all 51 SuSL-seq CG samples. When we compared average normalized depth for these two CNVs between clinical and previously published promastigote samples for each genetic group, we found good agreement for M-locus with correlation R 2 = 0.941 and p-value = 2.96 x 10 −4 , and slightly lower depth for H-locus in clinical samples with correlation R 2 = 0.571, p-value = 4.84 x 10 −2 (Fig 2C). The ISC1 group contained many other CNVs that were used to test the accuracy of local depth measured in SureSelect platform. We found correlation (R 2 = 0.9921, p-value = 5.3449e-14) between the depth in the ISC1-specific CNVs in the previously analyzed promastigote samples and the depth in the BPK72BM clinical sample (S9 Fig).

Comparison of paired clinical and cultured isolate samples
For 12 of the SuSL-seq clinical samples, paired cultured parasite isolates derived from the same clinical material that was used for direct sequencing were available. We compared the genotypes within each of these 12 pairs, at three levels: SNPs, indels and chromosome copy number. SNP analysis revealed that nine SuSL-seq samples were assigned to different ISC genotype groups to the matched cultured isolates (Table 1), with samples assigned to different core group genotypes differing by 9-26 homozygous sites and 3-13 heterozygous positions. In two cases, the clinical samples were genotyped as being in the CG and the isolate was assigned to ISC1; these pairs of samples were far more distinct, differing in 29942 and 34146 homozygous SNPs and 4370 and 167 heterozygous sites respectively. For the 3 paired samples with matching ISC genotypes, we found only minor nucleotide differences (0-9 homozygous and 3-7 heterozygous SNPs). We found differences in Indel variants in 10 out of 12 samples: in one of them (BPK081) a single base insertion was observed in LdAQP1 (described in detail above) in the bone marrow, which was not present in the paired culture (Table 1). Finally, chromosome copy number differed in 10 paired samples, while in BPK087 and BPK081 we found no significant differences in somy values, and parasites remained overall disomic in culture ( Fig 3A). Accordingly, none of the 12 paired samples revealed identical genomic features between the clinical samples and the derived isolates.
We then focused on the nine paired samples for which different ISC types were detected in the bone marrow and the derived isolates. Considering the sympatric ISC genotype diversity highlighted above, it is possible that some patients harbor polyclonal infections, with a given genotype dominant in a human host and another one dominant in culture, because of fitness differences in the respective environments. This possibility was explored by further genetic analyses and by competition experiments.
First, we found a second and minor genotype in the SuSL-seq mapped reads of two clinical samples (BPK471BM and BPK296BM), at a proportion estimated as 9.2%, and 6%, respectively ( Fig 3B). We also found the evidence of polyclonality in one of the cultured isolates, which differed in its ISC genotype from the paired clinical sample (Table 1): in BPK157MO, ISC1 constituted the main genotype (93.85% of the reads), and we found 6.15% that were not ISC1 at all ISC1 specific SNP sites. Secondly, we tested for the presence of more than one genotype in unenriched clinical samples by applying ISC single locus genotyping (ISC-SLG) combined with NGS-based deep amplicon sequencing using four previously described diagnostic SNPs/insertion allowing identification of the ISC3, ISC4, ISC5 and ISC6 genotypes [23]. Short regions flanking the SNP/indel were amplified, sequenced and the frequency of both alleles in all the tested samples was estimated. In order to increase the chance of detecting the presence of minor genotypes, we selected clinical samples with the highest parasite loads and/or highest total DNA contents. In addition, we included polyclonal clinical samples BPK471BM and BPK296BM (see above), two clinical sample-isolate pairs (BPK077 and BPK276) and cloned lines, which served as positive or negative controls of the presence of polymorphism at the diagnostic SNP/indel sites. The estimated percentage of Leishmania DNA and number of Leishmania genome equivalents in the clinical samples varied between 0.05-2.85%, and 12-60901, respectively (S1A Dataset). We confirmed the principal ISC genotypes defined by SuSL-seq in all clinical samples (S1A Dataset). We observed small traces of alternative alleles representing other genotypes in most of the clinical samples. However, these allele frequencies were observed only at the background base noise level, measured in the control cloned lines (S10 Fig and S1A Dataset). For the AQP1 insertion, the situation was similar with the exception of the sample BPK120BM (defined as belonging to ISC9), which indicated the presence of a small population of ISC5 genotype (z-score:3.2; p-value: 0.0014). The reciprocal was found in BPK276BM, an ISC5 sample in which a small proportion of WT alleles was encountered and we found no alleles containing the ISC5-specific AQP1 insertion in the paired clinical isolate. Although these results are not significant (p-value: 0.139), they are consistent with the observed ISC5 genotype in BPK276BM, and ISC6 genotype in the derived isolate (S1A Dataset). Finally, to test whether different growth rates could explain the different genotypes observed between direct clinical samples and those grown in vitro, we set up a growth-competition experiment. We inoculated flasks with cloned strains BPK091 and BPK077 (ISC4 and ISC6, respectively), and BPK067 and BPK091 (ISC3, and ISC4, respectively) together in the following ratios (strain1/strain2): 90/10, 50/50, and 10/90. Using Sanger sequencing to monitor the presence of diagnostic SNPs that could distinguish the two genetic groups at different time points during in vitro cultivation, we were able to detect which of the two co-cultured strains was dominant. We reproducibly observed for both parasite combinations that one strain consistently dominated the culture, even if it was only present as 10% of the initial inoculum (S11 Fig). Interestingly, BPK091, which was present in both mixes, became dominant after 21 passages when co-cultured with BPK077 but after only five passages it was out-competed by BPK067.

Discussion
We demonstrated here proof-of-principle for the sequencing of Leishmania genomes directly from clinical samples of patients with VL. Tested on 63 clinical samples from Nepal, SuSL-seq showed a high analytical sensitivity: we were able to (i) perform phylogenomic analyses on 82.5% of samples, (ii) assign 97% of samples to previously defined ISC genotypes using a set of diagnostic SNPs, (iii) estimate chromosome copy number in 82.5% of samples and (iv) identify large local CNVs in 83%. With the current design of the baits, our method should be applicable to all the parasites causing visceral leishmaniasis i.e. L. infantum and L. donovani, in East Africa, the Mediterranean basin and Latin America. Further work is required to test SuSL-seq on other host samples (animal reservoir, insect vectors, or different host tissues, including blood). It would be desirable to improve its performance in Leishmania samples with low parasite load and/or low amount of total DNA: here, the combination of SureSelect with other enrichment methods might prove useful.
Two alternative methods of enrichment deserve attention for future work. The first one consists in removal of methylated host DNA [13] and takes advantage of the differential levels of methylation between the mammalian host and pathogenic DNA. In particular it has been applied to enrich samples in bacterial or Plasmodium spp DNA through differential restriction digestion and commercial kits are available to facilitate this approach [13,24]. Although this approach of enrichment is promising, it requires large amounts of total DNA (1-2μg) and high content of pathogen DNA (10% for Plasmodium containing samples), precluding its application on many clinical samples. Moreover, at the beginning of our project, the methylation status of Leishmania DNA was unknown, therefore we did not consider this enrichment method: noteworthy, we recently preprinted a study showing the low methylation status of Leishmania [25], paving the way for exploring enrichment methods based on removal of methylated host DNA. The second alternative enrichment method, selective whole genome amplification (SWGA) [14], is based on amplification with phi29 using short oligonucleotides that preferentially bind to several sequences across the whole genome of a parasite [14,26,27]. As a result, pathogen's genome is preferentially amplified allowing the enrichment. The advantage of this method is that it can be applied to relatively low amounts of total DNA (as low as 5ng) and to samples with much lower levels of parasite's DNA content (in the range of 0.03% -0.1%, depending on the species and the sample used). However, as the enrichment is based on whole genome amplification, it can be a challenge to obtain even genome coverage. As it uses multiple displacement amplification (MDA) it is not directly applicable for aneuploidy measurements [28].
Our study sheds new light on the biology of Leishmania in the human host. A first aspect concerned the karyotype of amastigotes. Massive aneuploidy has been described in cultivated Leishmania promastigotes of all species [29]. In particular, sequencing of 204 L. donovani genomes [8] revealed strong differences in karyotype between parasite strains, with a baseline overall disomy and up to 22 polysomic chromosomes (essentially 3N and 4N) in single strains. In sharp contrast, we did not find high levels of aneuploidy in the amastigotes present in clinical samples of the same region. Besides chromosome 31 which is constitutively tetrasomic in all Leishmania species, most chromosomes in amastigotes of clinical samples were inferred to be disomic, and only a few monosomic or with intermediate coverage depth, the latter likely reflecting mosaic aneuploidy in the analyzed parasite populations, with mixtures of cells of different karyotypes [30]. Overall, the differences in aneuploidy observed between clinical samples and derived isolates are consistent with our previous findings where we reported close to diploid karyotype in parasites adapted to Syrian golden hamster, and highly dynamic aneuploidy during propagation of L. donovani in vitro [11]. In that study, genomic plasticity was shown to be mirrored at transcriptome level, with aneuploidy strongly correlated with the abundance of corresponding transcripts [11]. It seems that Leishmania compensate the lack of transcription regulation at initiation by a gene dosage strategy and might use this strategy for a rapid adaptation in response to environmental cues [11].
A second potential finding is related to the observed genotypic differences between 12 clinical samples and directly derived cultured isolates. Strictly speaking and considering different types of genetic variants (SNPs, indels an somy), none of the 12 isolates showed an identical genotype to the parasites present in the paired bone marrow samples. We cannot exclude that the 1-nt indel in the LdAQP1 of BPK81BM 'mutated' during culture maintenance and restored the reading frame in the corresponding isolate's genome. However, knowing that frame shifts in this gene have been linked to Leishmania antimony resistance [8], this result shows the risk of losing drug resistance markers when genotyping isolates. The differences in karyotype as observed in 11/12 paired samples could also be acquired during culture, but could also have resulted from selection from a mosaic background, as shown experimentally [30]. More generally, however, the SNP differences that led to isolates and SuSL-seq samples representing different ISC genotypes did not appear to be explained by mutations appearing in vitro; each ISC group is characterized by a set of diagnostic SNPs and the observed switch of ISC genotypes concerned all diagnostic SNPs in these sample pairs. In two cases, the isolate and clinical samples differed at tens of thousands of sites, and clearly represented the two very different populations of parasites circulating in VL patients in Nepal. We excluded SNP-calling artefacts, and the clinical work followed specific procedures for labeling, processing and tracking all samples. Thus, excluding experimental mistakes as the reason for these results, the most likely explanation is polyclonality of the infection in the host. In such a scenario, different clones would be most abundant in the host and the promastigote culture medium, due to their differences in respective fitness in these two environments.
We have attempted to verify the polyclonality hypothesis through a set of direct and indirect experiments. On one hand, in two clinical samples and one cultured isolate we detected minor reads (above 6%) with the signature of a second ISC genotype. We also tried to detect the second genotype directly in selected bone marrow samples with amplicon sequencing of 4 loci, which contain SNPs/AQP1 indel specific for different genetic groups from ISC. However, we were unable to find any statistically robust evidence for presence of a second genotype with this method, except in one case, BPK120BM. This is most likely due to a sensitivity issue of the amplicon sequencing method which was applied: (i) without enrichment (in most cases, in presence of more than 99% of human DNA, S1D Dataset), (ii) to find minor variants and (iii) targeting one single variant within (iv) a single copy gene. The two clinical samples BPK296BM and BPK471BM illustrate this problem well: they were clearly shown to be polyclonal from SuSL-seq data, while the second clone could not be detected from amplicon sequencing, most likely due to the low number of Leishmania genome equivalents (respectively, 43 and 12). On the other hand, the plausibility of this hypothesis was indirectly validated by a competition experiment in which we co-cultivated strains of different ISC genotypes in the same tube: even if a strain constituted only 10% of the initial inoculum, it could rapidly and in a reproducible way dominate over time in culture.
Further work is required to assess the extent of polyclonality among L. donovani infections. The phenomenon is common in Plasmodium, especially in high transmission areas [31]. Among trypanosomatids, it has been reported in Trypanosoma cruzi, where among others multiple genotypes were found to be transmitted congenitally [32] and temporal variation of genotypes was observed in asymptomatic carriers [33]. In Leishmania, mixed infections of different species were reported in wild rodents [34], horses [35] and humans [36,37]. A recent study documented the possibility of polyclonality in L. donovani from East Africa, as different genotypes were isolated from different organs of the same patient [38]. A similar observation was made on a Spanish dog in which two different zymodemes of L. infantum were collected from the skin and a lymph node [39]. Furthermore, natural Leishmania hybrids are regularly reported all over the world (see for instance [40,41]) and in our previous study in the ISC [8], we found 8 isolates of hybrid origin (out of 192 isolates of the CG, i.e. 4.2%). This could only occur in sand flies as a result of polyclonal infections.
The present work highlights the complexity of L. donovani infections in the human host. It appears that a patient can be infected by multiple clones showing different genotypes, which possibly vary at the somy level. This genetic diversity could offer an adaptive advantage to the whole population of cells, by providing 'individual' solutions to different environments, leading to different dominant genomes in the patient and in culture. We showed that this did not have a major impact so far for phylogenomic studies. However, if any link is to be made between parasite genome variation and clinical phenotypes-for example treatment outcome or pathogenicity-analyzing Leishmania directly from clinical samples is necessary, as both the genotype and aneuploidy in the mammalian host and in vitro culture can differ.

Data Release
Raw data was deposited in the European Nucleotide Archive (ENA) with the accession number ERP110990 Blue line represents the frequency for the non-ISC5 allele, while the orange line represents the ISC5-specific allele, which contains a 2bp-insertion in the AQP1 locus. Arrows indicate the two clinical samples in which a second allele was detected above the background, being statistically significant for BPK120_ISC5_BM. BM, Bone Marrow; SP, Spleen aspirate; MO, cultured isolate; NC, Negative Control, a cloned strain where the diagnostic ISC5-specific allele is absent; PC, Positive Control, a cloned strain characterized by the presence of the diagnostic ISC5-specific allele. (TIF) S11 Fig. Growth competition experiment. Flasks were inoculated with individual cloned strains or mixtures of two cloned strains in the following ratios: 90/10, 50/50 and 10/90. Cultures were analyzed at the beginning of the experiment (Start), after 5, 15, 20 and 24 passages (P5, P15, P20 and P24 respectively). Diagnostic PCR that allows to distinguish between the two compared strains was used to monitor the presence of each strain in culture. A selected fragment of chromatogram containing the variable nucleotide (labelled with red circle) is shown. Arrows point at the samples were the changes in dominance occurred. A. Competition between the BPK091, and BPK077 strains (ISC4, and ISC6, respectively). B. Competition between the BPK091, and BPK067 strains (ISC4, and ISC3, respectively). (TIF) S1 Dataset. A. Leishmania genome equivalents in samples used for NGS-based amplicon sequencing. The table shows the percentage of Leishmania DNA in the sample, the total amount of DNA, the amount of Leishmania DNA and the corresponding Leishmania genome equivalents used in each reaction to amplify short DNA fragments containing the diagnostic SNPs/AQP1 insertion for specific ISC genotypes. SuSL-seq genotype denotes the genotype assessed based on the phylogenomic clustering, while amplicon-seq genotype is based on the 4 diagnostic SNPs described by Rai et al. (ref 10). $, there are no ISC group-specific SNPs for ISC9 and ISC11, hence we could only report the negative results for ISC3-6 diagnostic SNPs. B. Excluded regions for somy estimation. For all chromosomes, the first 7000 bases and the last 2000 bases were excluded for depth evaluation. In addition, positions (x) in the following ranges were excluded because of the depth errors associated with SureSelect enrichment and because of the known major repetitive or amplified regions (see remarks). C. Genotyping using group specific SNP motifs: The table shows the ISC group, chromosome, position, reference allele, SNP allele, SNP score based on GATK, reference base motif and SNP base motif. All the diagnostic SNPs were homozygous SNPs except those of ISC7. Here ISC7 genotype was not detected: in a previous study (1), we found that genotype exclusively in India. A diagnostic SNP and reference base are located at the 12th base. D. ISC5 AQP1 insertion motifs: The reference base motif and AQP1 insertion base motif, along with neighboring motives that includes 2 base insertions of any base combinations at a given position are shown. The frequency counts of these motives were given for each sample and these neighboring motives were used to estimate the error rates for detecting any two-base insertion nearby. These data indicated that the insertion detection error rates were extremely low. ISC5 strains showed higher insertion error rate but these were likely associated with base errors in the detection of the AQP1 insertion. E. The sequence depth and coverage statistics: The average values of median depth of 36 chromosomes, the depth standard deviation, the genome coverage of Sure-Select enriched and raw samples were given for each sample. Low sequence depth samples were marked as LOWDEP, indicating that these samples were not included in the main phylogenetic tree because of their missing depth and coverage. Somy detection accuracy classification was assigned to each sample. High corresponds to 36 high genome coverage samples, Medium corresponds to 16 medium range genome coverage samples and Low corresponds to 9 low genome coverage samples. NA corresponds to 2 samples with almost no informative genome coverage. Further descriptions were found in the text. F. Sympatric genotypic diversity at VDC (village development committee, the lowest administrative unit in Nepal) level (available on request to authors). VDCs where 2 different ISC genotypes (or more) were identified. � , because of lower sequence quality, ISC genotype was defined by a set of diagnostic SNPs. 1 Samples 80, 207 and 208 originate from subjects belonging to the same family: they presented 2 different genotypes sampled at 10 months of interval. 2 Samples 89 and 91 originate from subjects belonging to the same family: they presented 2 different genotypes sampled at the same time. G. New insertions identified in SureSelect samples: These indels were in non-repetitive regions and supported with depth greater than 60. They were only identified in SureSelect samples. This illustrated that previously unknown indels can be detected in SureSelect samples when there was sufficient depth. These indel calls were verified on the IGV alignment viewer to ensure the accuracy of indel alternative allele frequencies.