Genomic variation in Plasmodium vivax malaria reveals regions under selective pressure

Background Although Plasmodium vivax contributes to almost half of all malaria cases outside Africa, it has been relatively neglected compared to the more deadly P. falciparum. It is known that P. vivax populations possess high genetic diversity, differing geographically potentially due to different vector species, host genetics and environmental factors. Results We analysed the high-quality genomic data for 46 P. vivax isolates spanning 10 countries across 4 continents. Using population genetic methods we identified hotspots of selection pressure, including the previously reported MRP1 and DHPS genes, both putative drug resistance loci. Extra copies and deletions in the promoter region of another drug resistance candidate, MDR1 gene, and duplications in the Duffy binding protein gene (PvDBP) potentially involved in erythrocyte invasion, were also identified. For surveillance applications, continental-informative markers were found in putative drug resistance loci, and we show that organellar polymorphisms could classify P. vivax populations across continents and differentiate between Plasmodia spp. Conclusions This study has shown that genomic diversity that lies within and between P. vivax populations can be used to elucidate potential drug resistance and invasion mechanisms, as well as facilitate the molecular barcoding of the parasite for surveillance applications.


Results
We analysed the high-quality genomic data for 46 P. vivax isolates spanning 10 countries across 4 continents. Using population genetic methods we identified hotspots of selection pressure, including the previously reported MRP1 and DHPS genes, both putative drug resistance loci. Extra copies and deletions in the promoter region of another drug resistance candidate, MDR1 gene, and duplications in the Duffy binding protein gene (PvDBP) potentially involved in erythrocyte invasion, were also identified. For surveillance applications, continental-informative markers were found in putative drug resistance loci, and we show that organellar polymorphisms could classify P. vivax populations across continents and differentiate between Plasmodia spp.

Conclusions
This study has shown that genomic diversity that lies within and between P. vivax populations can be used to elucidate potential drug resistance and invasion mechanisms, as well as facilitate the molecular barcoding of the parasite for surveillance applications.

Background
The Plasmodium vivax malaria parasite is the second most virulent malaria species after P. falciparum. Geographically, it is found throughout Asia, South and Central America, Oceania, Papua New Guinea (PNG, n = 6) [13]). All sequencing data for non-reference strains were generated using Illumina paired end technologies (read lengths !75bp). The raw sequence data were mapped to the Sal-1 reference genome (version 10.0) using bwa-mem with default parameters. SNPs (n = 447,232) were identified using the samtools software suite (samtools. sourceforge.net) with high quality scores (phred score >30, 1 error per 1 kbp). Genotypes were called using ratios of coverage, where the minimal heterozygous calls still present after filtering were converted to the majority genotype if the coverage ratio was 80:20 or greater [18,19]. SNPs were retained if they were biallelic, had low genotype missingness (<10%) and heterozygous (<0.4%) calls. SNPs in regions of extreme coverage and very low coverage were excluded, as well as in non-unique regions (using a k-mer approach with length of 54 bp) and highly polymorphic VIR genes. Two samples were found to have P. vivax and P. falciparum co-infections (ERR020124 and SRR828528), and were excluded from population genetic analysis. Isolates were retained if they had at least average 10-fold genome-wide coverage, and at most 10% missing genotype calls. The final high quality dataset consisted of 46 (62.2%) isolates (Thailand 22, Southeast Asia 24, South America 11; other 11; S1 Table) and 219,288 SNPs, and used as the basis of population genetic analyses. FreeC software (http://bioinfo-out.curie.fr/projects/ freec/tutorial.html) was used to identify regions of the genome with a significant increase or decrease in read coverage identifying potential copy number variants (CNVs) after accounting for GC bias through coverage normalization. Regions identified as CNVs were inspected visually and assessed using de novo assembly methods [20].

Population genetics
Genetic diversity was estimated using the average pairwise nucleotide diversity (π) with the R package "pegas". An in-house R script was used to compute the allele frequency-based Tajima's D test [21] to identify genes under balancing selection in the individual populations (values > 2.5; [18]), this method was chosen over the dN/dS approach given the latter being not fit for analysis on individual populations [22]. To detect signals of directional selection, the integrated Haplotype Score (iHS) approach [23] was applied to individual populations supported by a principal component analysis (PCA). This approach used the most frequent allele where mixed calls where found so the haplotype analysis will be based on the most abundant strain in each sample [7]. P-values for iHS were computed from standardised values based on a 2-tailed conversion from a Gaussian distribution [19]. The Salvador-I being the reference and oldest sample was used as ancestral haplotype. Multiplicity of infection was estimated using a novel method of counting the unique haplotypes formed by polymorphism on paired sequencing reads (estMOI, [24]). For comparisons between populations, we first applied PCA and neighbourhood joining tree clustering based on a matrix of pairwise identity by state values. These analyses were followed by applying the cross population long-range haplotype method (XP-EHH [25], Rsb implementation [19]) and the population differentiation metric F ST [26]. P-values for the Rsb estimates were calculated using a Gaussian approximation [19]. A significance threshold of P < 0.001 was established for both iHS and Rsb using bootstrapand permutation-based simulation approaches [18,19]. We used the ranked F ST statistics to identify the informative polymorphism for the barcoding of populations and driving the clustering observed in the PCA. Linkage disequilibrium (LD) was assessed in the two populations with the largest sample sizes (Thailand and South America) using the r 2 and D' metrics [27], calculated for pairs of SNPs with different physical separation up to 2 kbp using a sliding window approach. The SNPs were annotated and effects of variants on genes (such as amino acid changes) were predicted using snpEFF software [28]. The R statistical package was used to analyse SNP data, including implementation of selection analyses using the "rehh" library.

Genetic polymorphisms
The genomic coverage in the nuclear genome was high (median 103-fold, range (30-5973fold), and in keeping with multiple organellar copies, the mitochondria and apicoplast coverage was 30-fold and 1.8 fold greater than the nuclear coverage. The density of SNPs in the nuclear genome (219,288 SNPs, 1 every 99 bp) was greater than in the mitochondrial (23 SNPs, 1 every 165 bp) and apicoplast genomes (176 SNPs, 1 every 165 bp) (S2 Table). Although 60% of the annotated reference genome is coding (chromosomal range: 54%-64%), approximately half the SNPs in the isolates were found in genic regions (mean 48% per chromosome, range 43% to 52%) (Fig 1A). The proportion of non-synonymous sites is consistent with those found in other Plasmodium species, with 52% of coding SNP sites being non-synonymous in the nuclear genome, 36% in the mitochondrion and 56% in the apicoplast. The differences in these genomes suggest they may be subject to differential selective pressure [29]. The majority of SNPs are rare, with nearly half of the mutations (45%) being observed in single samples ( Fig  1B) as seen in other Plasmodium populations [18]. There was some evidence of polyclonality in 22 samples (Cambodia 1/2, Colombia 5/5, Madagascar 2/2, PNG 2/5, Thailand 11/22).
Analysis of structural variants and copy number variants was limited to Thai, Cambodian and Madagascan isolates, which had high and uniform genomic coverage. CNVs were located less than 1 kbp distance from the MDR1 gene (chromosome 10, PVX_080100) in Cambodian and Thai isolates (S1 Fig). Several MDR1 variants have previously been reported, some considered putative chloroquine-and mefloquine-resistance alleles [14, [30][31][32][33][34][35]. At the MDR1 locus, we observed either a duplication of~35kb (position 351kbp to 389kbp, n = 1, Thailand), a major deletion in the promoter region of the gene (n = 7, Thailand; n = 1 Cambodia), or a combination of both structural variants, including two copies, one with the deletion in the promoter and another copy with a complete promoter (n = 4, Thailand); as confirmed by the increase or decrease in coverage and accumulation of split reads in the regions where a break in the coverage occurs. The known duplication in the Duffy binding protein PvDBP (chr. 6: 974,000-982,000, PVX_110810) in Malagasy [36] was confirmed in one of the two Madagascan isolates (SRR828416). The PvDBP gene is potentially involved in erythrocyte invasion, and the duplication was also observed in thirteen Thai isolates. A further duplication was observed in Pv-fam-e (a RAD gene, chr. 5: 895,000-900,000, n = 8, Thailand), a gene linked to P. vivax selectivity for young erythrocytes and/or immune evasion [36].

Assessing genetic diversity, LD and positive directional selection
The average polymorphism (pair-wise mismatches measured by nucleotide diversity π) was calculated by gene and chromosome. There was little difference across the chromosomes with mean 11.1x10 -4 (range 6.0 x 10 −4 to 19.0x10 -4 ), which is consistent with other studies with similar sample size [14] as well as larger datasets when restricted to high quality SNPs (1.5x10 -3 ) [6, 7]. LD decays rapidly for non-rare polymorphism (minor allele frequency ! 5%) within a few hundred base pairs, and reaches a baseline within 500bp in South American and Thai nuclear genomes (Fig 1C). Like P. falciparum, there is a high correlation between non-rare SNPs (median D' 0.918, range 0.425-1) in the mitochondrial and apicoplast genomes supporting the inference that the organelles are co-inherited and supporting the view that these SNPs have potential utility for barcoding [29].
To examine evidence for signatures of positive natural selection we calculated the iHS metric in the Thailand and South America populations, informed by the population structure reported in S3 Fig. Five contiguous loci of strong positive directional selection were identified, including the MRP1 gene (PVX_097025) and its promoter region in Thailand, and a region surrounding the MRP2 gene (PVX_097025, P. falciparum homologue associated with primaquine and antifolate drug sensitivity [14]). Several surface proteins were identified in both populations, including the MSP7 and MSP3.1, which are thought to be under selection pressure due to their role in erythrocyte invasion and strong vaccine candidates and have been identified before by other studies using sanger sequencing [29] (Table 1, Table 2

Allele frequency spectrum and balancing selection
The allele frequency spectrum of different classes of nucleotide sites all show an excess of rare alleles, with coding, non-synonymous, synonymous and intergenic sites more skewed than expected under a Wright-Fisher model of constant population size [18]. This observation could indicate a population expansion in the recent past, where as a population grows in size, the frequency of rare alleles also increases [18]. The Tajima's D method was applied to genes with at least five SNPs in the two main populations (Thailand 4,673 (91.0%) and South America 3,549 (70.0%) genes). The majority of Tajima's D values were negative (Thailand 90.2%; South America 64.4%), reinforcing the presence of an excess of low frequency and singleton polymorphisms, potentially due to population expansion in the recent past or purifying selection. For Thailand, we identified 398 (8.5%) genes with positive Tajima's D values, of which 14 were in excess of 2.5 and potentially under balancing selection (Table 3). Similarly, for South America, of the 1,260 (35.5%) values that were positive, 12 were in excess of 2.5 ( Table 3). The loci under potential balancing selection in both populations encode proteins with predominantly roles surface proteins (e.g. MSPs) and antigens. The majority of the 26 genes identified

Population structure and evidence of differing directional selection in populations
Both a principal component and a neighbourhood joining tree analysis (Fig 2, S4 Fig) revealed clustering by continent, in keeping with similar P. falciparum analyses [18,19]. The across population long-range haplotype method (Rsb implementation) was applied to compare Thailand to the South American population, to identify regions potentially under recent directional selection ( Table 4). We detected several loci including at multidrug resistance-associated protein MRP1 (PVX_097025), and the CCR4-associated factor 1 (CAF1, PVX_123230) located  Table). No evidence was observed of a hard sweep around the MDR1 copy number gene. However, nine non-synonymous SNP mutations were identified, five of which have been reported previously. These included the fixed alleles T958M and M908L, F1076L at high frequency across populations, and G698S and S513R absent from South America

Towards molecular barcoding of P. vivax
The development of molecular barcode for P. vivax could ultimately assist with surveillance and disease control. Previous work [51] has described a 42 SNP barcode to classify geographically P. vivax across 7 countries. Across the 46 isolates analysed here, we found 3 SNPs in the barcode to be either non-segregating or not passing quality control filtering. Use of the remaining 39 SNPs led to imperfect clustering by continent (S4 Table, S5 Fig). Application of the F ST population differentiation metric identified SNPs driving the observed differences between Thailand, South America and other populations (S5 Table). These SNPs occurred in drug resistance loci, including MRP1 (PVX_097025), DHPS (PVX_123230) and UBP1 (PVX_081540) (all F ST > 0.72), and in close proximity (e.g. PVX_089960 within 8kb of DHFR).
Population differentiation due to genetic diversity in drug resistant loci is also observed in P. falciparum [18,19]. Previous work has proposed the mitochondria and apicoplast organellar genomes as candidate regions for a barcode [29]. Genotyping of organellar markers would benefit from greater copy number and coverage as well as highly conserved sequences [29]. Eight markers across five apicoplast genes could differentiate Thai and Southeast Asian samples from the other isolates, and two non-genic markers were found to be exclusive to South America (all F ST >0.7, S6 Table). No informative mitochondrial markers were identified (all F ST <0.7). Further, as the organelle genomes are known to be highly conserved between Plasmodia species, when comparing a set of P. falciparum geographical markers [26] to P. vivax sequences, we found evidence of positions close in the sequence. Two of the samples (ERR020124 and SRR828528) had a high density of mixed calls in the organellar genomes, in this case, a signature of P. falciparum overlaying onto P. vivax (S6 Fig). In general, this density signature is indicative of a coinfection of P. vivax with another Plasmodium spp. By comparing the sequencing reads to the Plasmodium knowlesi reference genome [52], there was no evidence of any vivax and knowlesi co-infections. However, the presence of a unique triallelic SNP reinforces the potential for an organellar inter-plasmodia species barcode (S6 Fig).

Discussion
Several studies have previously described the genomic diversity of P. vivax populations using whole genome data, but with low sample sizes. Recently, two papers using a combined collection of over 400 isolates from 17 countries described major genomic diversity in Plasmodium vivax [6,7]. Here we analysed a complementary collection of 46 high quality isolates spanning 10 countries across 4 continents in order to position them within the context of this new work. As expected we confirmed that P. vivax genomic diversity is greater compared to P. falciparum, and even at a relatively low sample size, the samples clustered geographically. We reveal a wider genomic distance between South American and Southeast Asian continents than observed between P. , however more research is needed in this area. Microsatellite genotyping has been used previously to cluster geographically P. vivax isolates, and together with antigen genotyping identify mixed infections and extent of transmission, used as the basis of genetic epidemiology. In comparison, whole genome sequencing provides a higher specificity in the application of geographical clustering [51]. While other studies have focused on creating a barcode using the nuclear genome [51], we also considered organelle genomes (mitochondrion and apicoplast), which are more stable over time, do not undergo recombination and are co-inherited [29]. The analysis revealed organellar markers that are potentially Southeast Asian and South American specific, and others that highlighted the presence of multi-species mixed infections. The sequencing of large numbers of isolates, beyond currently published samples sizes, will be required to establish robust intra-and interspecies organellar-based barcode. Such large-scale datasets across multiple regions will also serve to identify the high genomic diversity that lies within and between P. vivax populations, which could be exploited for biological insights, including elucidating drug resistance and invasion mechanisms, and ultimately measures of disease control.

Conclusion
This study has shown that genomic diversity that lies within and between P. vivax populations can be used to elucidate potential drug resistance and invasion mechanisms, as well as facilitate the molecular barcoding of the parasite for surveillance applications. South America. Ã iHS integrated haplotype score; see Table 1 and Table 2 Table 4 for a summary of the hits.