Whole genome sequencing of Plasmodium vivax isolates reveals frequent sequence and structural polymorphisms in erythrocyte binding genes

Plasmodium vivax malaria is much less common in Africa than the rest of the world because the parasite relies primarily on the Duffy antigen/chemokine receptor (DARC) to invade human erythrocytes, and the majority of Africans are Duffy negative. Recently, there has been a dramatic increase in the reporting of P. vivax cases in Africa, with a high number of them being in Duffy negative individuals, potentially indicating P. vivax has evolved an alternative invasion mechanism that can overcome Duffy negativity. Here, we analyzed single nucleotide polymorphism (SNP) and copy number variation (CNV) in Whole Genome Sequence (WGS) data from 44 P. vivax samples isolated from symptomatic malaria patients in southwestern Ethiopia, where both Duffy positive and Duffy negative individuals are found. A total of 123,711 SNPs were detected, of which 22.7% were nonsynonymous and 77.3% were synonymous mutations. The largest number of SNPs were detected on chromosomes 9 (24,007 SNPs; 19.4% of total) and 10 (16,852 SNPs, 13.6% of total). There were particularly high levels of polymorphism in erythrocyte binding gene candidates including merozoite surface protein 1 (MSP1) and merozoite surface protein 3 (MSP3.5, MSP3.85 and MSP3.9). Two genes, MAEBL and MSP3.8 related to immunogenicity and erythrocyte binding function were detected with significant signals of positive selection. Variation in gene copy number was also concentrated in genes involved in host-parasite interactions, including the expansion of the Duffy binding protein gene (PvDBP) on chromosome 6 and MSP3.11 on chromosome 10. Based on the phylogeny constructed from the whole genome sequences, the expansion of these genes was an independent process among the P. vivax lineages in Ethiopia. We further inferred transmission patterns of P. vivax infections among study sites and showed various levels of gene flow at a small geographical scale. The genomic features of P. vivax provided baseline data for future comparison with those in Duffy-negative individuals and allowed us to develop a panel of informative Single Nucleotide Polymorphic markers diagnostic at a micro-geographical scale.

Introduction Vivax malaria is the most geographically widespread human malaria, causing over 130 million clinical cases per year worldwide [1]. Plasmodium vivax can produce dormant liver-stage hypnozoites within infected hosts, giving rise to relapse infections from months to years. This unique feature of P. vivax contributes to an increase in transmission potential and increases the challenge of elimination [2]. Understanding P. vivax genome variation will advance our knowledge of parasite biology and host-parasite interactions, as well as identify potential drug resistance mechanisms [3,4]. Such data will also help identify molecular targets for vaccine development [5][6][7] and provide new means to track the transmission and spread of drug resistant parasites [8][9]. Compared to P. falciparum, P. vivax isolates from Southeast Asia (e.g., Thailand and Myanmar), Pacific Oceania (Papua New Guinea), and North and South America (Mexico, Peru, and Colombia) have significantly higher nucleotide diversity at the genome level [2]. This could be the result of frequent gene flow via human movement, intense transmission, and/or variation in host susceptibility [10][11][12][13][14]. P. vivax infections are also much more likely to contain multiple parasite strains in areas where transmission is intense and/or relapse is common [10,[15][16][17][18]. In Papua New Guinea, for example, P. vivax infections had an approximately 3.5-fold higher rate of polyclonality and nearly double the multiplicity of infection (MOI) than the P. falciparum infections [16]. Similar rates of polyclonality and MOI have also been reported in P. vivax in Cambodia [6]. It is possible intense transmission has sustained a large and stable parasite population in these regions [17,18]. In contrast, geographical differentiation and selection pressure over generations can lead to fixation of parasite genotypes in local populations. In the Asia-Pacific, P. vivax showed a high level of genetic relatedness through inbreeding among the dominant clones, in addition to strong selection imposed in a number of antimalarial drug resistance genes [19]. In Ethiopia, the chloroquine resistance transporter gene (Pvcrt) of P. vivax on chromosome 14 had been shown with significant selection in a region upstream of the promotor, highlighting the ability of P. vivax to rapidly evolve in response to control measures [20]. Apart from mutations, high copy number observed in Pvcrt and the multidrug resistant gene (Pvmdr1) has also been shown to be associated with increased antimalaria drug resistance [21,22].
Recent genomic studies have indicated that some highly polymorphic genes in the P. vivax genome are associated with red blood cell invasion and immune evasion [10,12,19,23]. They include the merozoite surface protein genes MSP1 (PVP01_0728900) and MSP7 (PVP01_1219700), the reticulocyte binding protein gene RBP2c (PVP01_0534300), and serine-repeat antigen 3 (SERA; PVP01_0417000) [23][24][25][26][27][28][29]. Polymorphisms in genes associated with immune evasion and reticulocyte invasion have important implications for the invasion efficiency and severity of P. vivax infections. Members of the erythrocyte binding gene family, including reticulocyte binding proteins (RBPs), Duffy-binding proteins (DBPs), and merozoite surface proteins (MSP3 and MSP7) have been previously shown to exhibit high sequence variation in P. vivax [20,30]. The polymorphisms in RBP1 and RBP2 genes may relate to an increased capability of erythrocyte invasion by P. vivax [31][32][33]. It has been suggested that PvRBP2b-TfR1 interaction is vital for the initial recognition and invasion of host reticulocytes [34], prior to the engagement of PvDBP1 and DARC and formation of a tight junction between parasite and erythrocyte [35]. Apart from PvRBP, Reticulocyte Binding Surface Antigen (PvRBSA) [36], an antigenic adhesin, may also play a key role in P. vivax parasites binding to target cells, possessing the capability of binding to a population of reticulocytes with a different Duffy phenotype [37,38]. Another erythrocyte binding protein gene (PvEBP), a paralog of PvDBP1, which harbors all the hallmarks of a Plasmodium red blood cell invasion protein, including conserved Duffy-binding like and C-terminal cysteine-rich domains [39], has been recently shown to be variable in copy number in the Malagasy P. vivax [39]. Functional analyses indicated that region II of this gene bound to both Duffy-positive and Duffy-negative reticulocytes, although at a lower frequency compared to PvDBP, suggestive of its role in erythrocyte invasion [40]. Both PvEBP1 and PvEBP2 genes exhibit high genetic diversity and are common antibody binding targets associated with clinical protection [41,42]. Other proteins such as tryptophan-rich antigen gene (TRAg), anchored micronemal antigen (GAMA), and Rhoptry neck protein (RON) have also been suggested to play a role in red cell invasion, especially in low-density infections [43][44][45][46][47]. Information of the polymorphisms in these genes will have important implications on the dynamics of host-parasite interactions.
Compared to Southeast Asia and South America where P. vivax is highly endemic, data on polymorphisms in erythrocyte binding gene candidates of P. vivax from Africa is limited. Filling this knowlege gap is critical for identifying functional genes in erythrocyte invasion, biomarkers for tracking the African P. vivax isolates, as well as potential gene targets for vaccine development. It was previously thought that most African populations were immune to P. vivax infections due to the absence of DARC gene expression required for erythrocyte invasion. However, several recent reports have indicated the emergence and potential spread of P. vivax across Africa [32,[48][49][50]. The objective of this study is to describe genomic variation of P. vivax from Ethiopia. Specifically, we examined the level of genetic polymorphisms in a panel of 43 potential erythrocyte binding protein genes that have been suggested to play a role in the parasite-host invasion process. In addition, we inferred transmission patterns of P. vivax infections from different study sites based on the genetic variants. A recent study by Auburn et al. [20] has compared the genetic variants of P. vivax from Ethiopia with other geographical isolates. In the present study, we focus on the genomic characteristics of P. vivax among different study sites in Ethiopia with the goals to establish a baseline for genome comparison with the Duffy-negative P. vivax in our ongoing investigation, as well as to develop a panel of Single Nucleotide Polymorphic (SNP) markers informative at a micro-geographical scale.

Ethics statement
Scientific and ethical clearance was obtained from the Institutional Scientific and Ethical Review Boards of Jimma and Addis Ababa Universities in Ethiopia, and The University of North Carolina, Charlotte, USA. Written informed consent/assent for study participation was obtained from all consenting heads of households, parents/guardians (for minors under 18 years old), and each individual who was willing to participate in the study.

Study area and sample collection
Genomic DNA was extracted from 22 clinical samples collected in Jimma, southwestern Ethiopia during peak transmission season (September-November 2016; Fig 1). Finger-pricked blood samples were collected from individuals with symptoms suspected of malaria infection (who has fever with axillary body temperature > 37.5˚C and with confirmed asexual stages of malaria parasite based on microscopy) or febrile patients visiting the health centers or hospitals at each of the study sites. Thick and thin blood smears were prepared for microscopic examination, and 4-6 ml of venous blood were collected from each P. vivax-confirmed patient in K2 EDTA blood collection tubes. For the whole blood samples, we used the Lymphoprep/Plasmodpur-based protocol to deplete the white blood cells and enrich the red blood cell pellets [51]. DNA was then extracted from approximately 1 ml of the red blood cell pellets using Zymo Bead Genomic DNA kit (Zymo Research) following the manufacturer's procedures. The extracted DNA were first assessed by nested and quantitative PCR methods to confirm and quantify P. vivax of the infected samples [52]. For these 22 samples, we performed microsatellite analyses using seven different loci [53]. Only monoclonal samples were selected and proceeded for sequencing. Whole genome sequencing was conducted on the Illumina HiSeq 3000 Sequencing Platform at the Wellcome Sanger Institute (European Nucleotide Archive [ENA] accession number of each sample in Table 1). The generated sequence reads were mapped individually to the publicly available reference genome PvP01 from Gene DB using BWA-MEMv2 [54][55][56]. Only reads that were mapped to P. vivax were included. The percentage coverage of the P. vivax reads in our samples were high enough to not affect the results.
An additional 24 sample sequence data were obtained as FASTQ files from the ENA. These samples were collected from Arbaminch, Badowacho, Halaba, and Hawassa in southwestern Ethiopia (Fig 1), the Duffy status of each of these 24 samples is unknown. All P. vivax genomes in this study were aligned to the PVP01 reference genome using BWA-MEMv.2 with default  settings [55,56]. The overall quality of each resulting BAM was assessed using FASTQC. Similarly, we concluded that the percentage of the P. vivax reads covered in the additional 24 samples were high enough to reflect the dominant signal of the variants and negate polyclonal influences. Two of our samples displayed a significant decline in average quality in read mapping and were therefore removed from further SNP variant and copy number variation analyses.
To provide a comparison of SNP and copy number variants with the Ethiopian isolates, we chose 50 additional P. vivax genomes from Southeast Asia (mainly from Cambodia and Thailand where several genomes are published and available) deposited in the ENA database [19]. These sequences were realigned using BWA-MEMv.2, with the same default settings as the Ethiopian genomes, to ensure all map files were constructed using the same tool and parameters. Furthermore, to determine the clonality of the isolates, we calculated the Fws (within-host heterozygosity) statistic using the moimix R package [57]. After performing the variant call, we removed the indels and filtered the SNPs. These high-quality SNPs were then used to calculate the Fws value using the getfws() function following the procedures suggested in the moimix R package.

SNP discovery, annotation, and filtering
Potential SNPs were identified by SAM tools v.1.6 mpileup procedure [58] in conjunction with BCF tools v.1.6 [58] across all 44 sample BAM files using the PVP01 reference genome. Compared to the Salvador-I, the PVP01 reference genome consists of 14 chromosomal sequences and provides a greater level of gene annotation power and improved assembly of the subtelomeres [56]. We analyzed only sequence reads that were mapped to these 14 major chromosomal sequences. The hypervariable and subtelomeric regions in our samples were discarded during the variant calling procedure and each sample BAM file had duplicates marked using SAMtools 1.6 markdup procedure. For the mpileup procedure, the maximum depth threshold, which determines the number of maximum reads per file at a position, was set to 3 billion to ensure that the maximum amount of reads for each position was not reached. Samples were pooled together using a multisampling variant calling approach. The SNPs were then annotated with SnpEff v.4.3T [59] based on the annotated gene information in GeneDB. Filtering was done using the following standard metrics, including Read Position Bias, Mapping Quality vs Strand Bias, Raw read depth, Mapping Quality Bias, Base Quality Bias, and Variant Distant Bias produced by SAM tools and BCF tools during the variant calling procedure. In Snp Sift, data was filtered by choosing SNPs that had a Phred Quality score � 40, a combined depth across samples (DP) � 30 based on post variant call, and a base quality bias >0.1 [60]. We then calculated the allele frequency for each SNP position for all 44 samples using the frequency procedure in VCF tools v.0.1.15 [61]. The total number of SNPs across all samples, as well as the number of nonsynonymous and synonymous mutations were recorded. Mutations were compared among the 14 chromosomes in addition to a panel of 43 erythrocyte binding genes. The same analysis was performed for the 50 Southeast Asian P. vivax genomes [19].

Copy number variation analyses
Copy number variation of gene regions was assessed with CNVnator [62]. CNVnator uses mean-shift theory, a partitioning procedure based on an image processing technique and additional refinements including multiple bandwidth partitioning and GC correction [62]. We first calculated the read depth for each bin and correct GC-bias. This was followed by meanshift based segment partition and signal merging, which employed an image processing technique. We then performed CNV calling, of which segments with a mean RD signal deviating by at least a quarter from genomic average read depth signal were selected and regions with a P-value less than 0.05 were called. A one-sided test was then performed to call additional copy number variants. SAM tools v.1.6 was utilized in our data preprocessing step to mark potential duplicates in the BAM files and followed the CNV detection pipeline [63]. We extracted the read mappings from each of BAM files for all chromosomes. Once the root file was constructed using the extracted reads, we generated histograms of the read depths using a bin size of 100. The statistical significance for the windows that showed unusual read depth was calculated and the chromosomes were partitioned into long regions that have similar read depth.
To validate the results from CNVnator, we used the GATK4 copy number detection pipeline to further examine gene copy number [64][65][66]. The read coverage counts were first obtained from pre-processed genomic intervals of a 1000-bp window length based on the PvP01 reference genome. The read fragment counts were then standardized using the Denoise Read Counts that involved two transformations. The first transformation was based on median counts, including the log 2 transformation, and the counts were normalized to center around one. In the second transformation, the denoises tool was used to standardized copy ratios using principal component analysis. Copy numbers were calculated for the additional 50 Southeast Asian genomes to compare with the 44 Ethiopian genomes using the same procedures described above.

Test for positive selection
Regions of positive selection were examined among the 44 P. vivax isolates from Ethiopia using the integrated haplotype score approach, specifically the SciKit-Allel for python, a package used for analysis of large-scale genetic variation data [67]. Before the samples were run through Scikit-Allel, genotypes for each of the samples were phased using BEAGLE [68]. Genes that were detected with signals of positive selection by SciKit-Allel, as well as a panel of 43 potential erythrocyte binding genes were further evaluated using the PAML package (Phylogenetic Analysis by Maximum Likelihood) [69]. Using the codeml procedure in PAML, DNA sequences were analyzed with the maximum likelihood approach in a phylogenetic framework. The synonymous and nonsynonymous mutation rates between protein-coding DNA sequences were then estimated in order to identify potential regions of positive selection. We created two models, the neutral model M1 and the selection model M2. The average d N /d S values were estimated across all branches in both M1 and M2 models and the average d N /d S values across all sites in the M2 model. The d N /d S values were compared between the two models using a likelihood ratio test for significant positive selection.

Comparison of nucleotide diversity among EBP gene regions
Based on the literature [23][24][25][26][27][28][29][30][31][32][33], we identified 43 gene regions that are potentially related to erythrocyte binding in P. vivax (S1 Table). These included the DBP (duffy binding protein), EBP (erythrocyte binding protein), MSP (merozoite surface protein), and RBP (reticulocyte binding protein) multigene families, the tryptophan rich antigen gene family (TRAg), GPIanchored microanemal antigen (GAMA), microneme associated antigen (MA), rhoptry associated adhesin (RA), high molecular weight rhoptry protein 3 (RHOP3), and rhoptry neck protein (RON) genes. Previous study has shown that the transcriptome profiles of the TRAg genes were differentially transcribed at the erythrocytic stages, indicating that these genes may play specific roles in blood-stage development [43]. The reticulocyte binding protein multigene family encodes genes that each have a receptor on the surface that is essential for the host-invasion stage of P. vivax [70]. The MSP multigene family, currently assumed to be a candidate for vaccine generation, also plays a role in the invasion stage of P. vivax and is also immunogenic [26]. The nucleotide diversity of 43 potential erythrocyte binding genes were compared among the 44 Ethiopian and 50 Southeast Asian P. vivax sample consensus sequences using DnaSP [71]. The Pairwise-Deletion method where gaps were ignored in each pairwise comparison was used for this calculation.

Genetic relatedness and transmission network analyses
Phylogenetic analyses were performed to infer the genetic relatedness among the 44 Ethiopian isolates. Sequence alignment was first conducted using a multiple sequence alignment program in MAFFT v. 7 [72]. The alignment was then trimmed to remove gaps using trimal (the gappyout option) that trimmed the alignments based on the gap percentage count over the whole alignment. After sequence editing, we concatenated all alignment files using FAScon-CAT-G [73], a perl program that allows for concatenation and translation (nucleotide to amino acid states) of multiple alignment files for phylogenetic analysis. We used the maximum likelihood method implemented in the Randomized Accelerated Maximum Likelihood (RAxML) v8 to construct phylogenetic trees [74]. The GTRGAMMA model was used for the best-scoring maximum likelihood tree. The GTR model incorporates the optimization of substitution rates and the GAMMA model accounts for rate heterogeneity. A total of 100 rapid bootstrap runs were conducted to evaluate the confidence of genetic relationships. In addition, we performed principal component analyses using the glPCA function in R, a subset of the adegenet package [75], to determine the genetic relatedness of the samples among the different study sites in Ethiopia. A transmission network was created using StrainHub, a tool for generating transmission networks using phylogenetic information along with isolate metadata [76]. The transmission network was generated using the locations of the samples as the nodes and calculating the source hub ratio for each location. The source hub ratio was calculated by the number of transitions originating from a node over the total number of transitions related to that node. A node with a ratio close to 1 indicates a source, a ratio close to 0.5 indicates a hub, and a ratio close to 0 indicates a sink for the P. vivax infections. To validate the analyses of Strainhub, we also performed F ST and ADMIXTURE analyses to determine the level of genetic differentiation among samples between the different districts in Ethiopia. The F ST analysis was done using our high-quality SNPs and the VCFtools function weir-fst-pop. We generated the corresponding bed, fam, and.bim files using PLINK v2 [77]. These files were then used in the ADMIXTURE analysis [78]. ADMIXTURE uses the same statistical model as the STRUC-TURE software, and calculates the estimates using a numerical optimization algorithm [78]. We performed the ADMIXTURE analysis from 1 (no genetic differentiation among all study sites) to five (all study sites were genetically differentiated) genetic clusters (K) and recorded the respective cross-validation scores. The optimal K was selected based on the smallest error rate and lowest cross-validation scores [78].

Distribution of SNPs among the chromosomes and EBP genes
A total of 123,711 SNPs was detected among the 44 P. vivax samples from Ethiopia (Fig 2), with 22.7% (28,118 out of 123,711) nonsynonymous and 77.3% (95,593 out of 123,711) synonymous mutations (Fig 3A). The highest number of high quality SNPs were observed on chromosomes 9 (24,007 SNPs; 19.4% of the total SNPs), 10 (16,852 SNPs; 13.6%), and 4 (11,354 SNPs; 9.2%); whereas the lowest number of SNPs were observed on chromosomes 2 (1,912 SNPs; 1.5%), 6 (2,977 SNPs;~0.2%), and 7 (4,779 SNPs; 3.9%; Fig 3A; S2 Table). The number of high-quality SNPs on each chromosome was not shown to be dependent on the size of the chromosome. For the isolates from Southeast Asia, 413,873 variants were detected after performing filtration, with 73,547 nonsynonymous (17.77%) and 340,326 synonymous (82.22%). Similar to the isolates from Ethiopia, chromosome 9 of the isolates from Southeast Asia had the largest number of SNPs (66,834 SNPs; 16.1% of the total variants), with 11% being nonsynonymous and 89% synonymous. This was followed by chromosomes 10 of the isolates from Southeast Asia (48,425 SNPs; 11.7%) with 24% nonsynonymous and 76% synonymous substitutions, and chromosome 4 (41,362 SNPs; 9.9%) with 22% nonsynonymous and 78% synonymous substitutions (Fig 3B). The number of detected SNPs on these chromosomes was much higher in the Southeast Asian than the Ethiopian isolates, given similar sample size.
For the isolates fom Ethiopia, the 43 erythrocyte binding genes accounted for 2,361 of the total SNPs, with 1,087 (46%) identified as nonsynonymous and 1,274 (54%) as synonymous mutations (Fig 3C; S1 Table). Among these genes, the highest number of SNPs were observed in the MSP3 multigene family (MSP3.5, MSP3.9 and MSP3.8) on chromosome 10 from the isolates from Ethiopia, with an average nucleotide diversity of 2.8%, respectively among our samples ( Fig 3B). By contrast, the lowest number of SNPs detected among all the 43 erythrocyte binding genes were the Duffy binding protein gene (DBP1) on chromosome 6 from the isolates from Ethiopia with a total of 13 SNPs, of which 12 were identified as nonsynonymous and one as synonymous mutations (Fig 3B). Likewise, another erythrocyte binding protein (EBP2), located also on chromosome 6 from the isolates from Ethiopia, was one of the least variable genes with only one nonsynonymous mutation. The TRAg gene family from the isolates from Ethiopia also showed a low level of nucleotide diversity when compared to the other EBP gene families with an average nucleotide diversity of 0.2% (Fig 3B).
For the Southeast Asian isolates, the 43 erythrocyte binding genes accounted for 6,130 total SNPs, with 2,923 (47.7%) nonsynonymous and 3,207 (52.3%) synonymous, a ratio similar to the Ethiopian isolates (Fig 3D). The nucleotide diversity was highest in the MSP gene family with a nucleotide diversity of 8.8% in MSP3.5 and MSP3.9 with 8.2%, followed by 6.9% in  8. On the other hand, the lowest nucleotide diversity was observed in the TRAg gene family (0.03% in TRAg23 and 0.03% in TRAg24), followed by RHOPH3 (0.04%) and RON4 (0.05%; Fig 3D).

Estimates of polyclonality based on Fws statistics
The moimix analyses showed that 24 of the 44 isolates from Ethiopia had Fws values of more than or equal to 0.90 with the highest values in 10 samples from Jimma, five samples from Badowacho, five samples from Arbaminch, two samples from Halaba, and the remaining two samples from Hawassa (S3 Table). Eight of the isolates from Ethiopia were shown to have low Fws values ranging from 0.4 to 0.7. They included isolates from Arbaminch (Fws value = 0.423) followed by a isolates from Jimma (Fws value = 0.512), indicative of the presence of more than one clone within the sample (S3 Table). For these eight isolates from Ethiopia, the SNPs and copy number variants were analyzed with the major clone only.

Gene regions under positive selection
Based on the integrated haplotype scores of our Ethiopian isolates, positive selection was detected in two gene regions (Fig 4). These included the membrane associated erythrocyte binding-like protein (MAEBL) on chromosome 9 as well as MSP3.8 on chromosome 10 ( Fig  4). Based on PAML, 17 out of the 43 erythrocyte binding genes showed evidence of positive selection ( Table 2; S4 Table). The majority of these genes belong to the TRAg multigene family. The TRAg genes had an average d N /d S ratio of 2.75 across all branches and an average of 5.75 across all sites for the M2 model tested for selection (Table 2). Compared to the other TRAg genes, TRAg15 had more sites detected under positive selection, with 50 of the sites showing a posterior probability greater than 50% and 43 showing a posterior probability greater than 95% (Table 2). While the TRAg4 gene had the highest d N /d S ratio across all sites among other TRAg genes, only six sites were shown under positive selection with a posterior probability greater than 50% and one with a posterior probability greater than 95%.
Apart from the TRAg4 genes, RBP1a and RBP2a belong to the RBP gene family also showed significant signals of positive selection (average d N /d S ratio across all sites: 5.11; Table 2). Among all the MSP genes, only MSP5, MSP9, and MSP10 indicated regions under positive selection. The MSP5 and MSP9 genes had an average d N /d S ratio of 3.85 across all sites and 1.11 across branches (Table 2). While MSP10 had an average d N /d S ratio of 1.16 across all branches and less than 1 across all sites, only seven sites were indicated with posterior probabilities greater than 50% and 95% (Table 2). Although MSP3.8 showed potential positive selection based on the integrated haplotype scores (Fig 4), PAML did not show significant evidence of positive selection.

Copy number variation and evolution of high-order copy variants
According to CNVnator, six gene regions showed copy number variation among our Ethiopian samples (Fig 5; S5 Table). Among them, three gene regions were detected with up to 2-3 copies and three gene regions with 4 copies or higher. Among the 43 erythrocyte binding genes, duplications were observed in DBP1 on chromosome 6 and MSP3.11 on chromosome 10. DBP1 ranged from one to as high as five copies, and MSP3.11 ranged from one to as high as three copies among our samples (Fig 5), consistent with previous findings [19,20,79]. The remaining erythrocyte binding genes were detected with a single copy across our samples. Compared to the Ethiopian isolates, a larger number of erythrocyte binding gene regions including DBP1, the MSP3 and RBP gene families were detected with high-order copies in the Southeast Asian isolates (Fig 5; S5 Table). These gene regions showed an average copy number ranged from one to as high as four copies, consistent with previous findings [19]. For instance, 17 of the Southeast Asian samples were detected with 2-3 copies and one sample with 4 or  Table 2

. A shortlist of 17 erythrocyte binding gene candidates that showed signals of positive selection based on the Likelihood Ratio Test of the M1 (neutral model) and M2 models (selection model) in PAML.
Gene ID is presented in S1 higher copies of MSP3.5, MSP3.8, and MSP3.9. Two samples were detected with four or higher copies of RBP2a, and one sample with 2-3 copies of MSP9. The higher copy number of RBP2a and MSP9 were not detected in a previous study by Pearson et al. [19] and this could be due to the use of different reference genome, i.e., Sal1 monkey strain in the previous and PVP01 Indonesian patient isolate in the present study. Though fewer erythrocyte binding genes were detected with copy number variation in the isolates from Ethiopia, the range of copy number  Table. Among them, PvDBP1 (PVP01_0623800) and PvMSP3.11 (PVP01_1030900) were associated with erythrocyte invasion. Other genes that were found to have high-order copy number were yet to be described exported plasmodium proteins. Annotation of these genes can be found in S4 Table. Among them, PvDBP1 (PVP01_0623800), members of the PvMSP3 gene family (PVP01_1031200, PVP01_1031300, PVP01_101400), PvMSP9 (PVP01_1445800), and PvRBP2a (PVP01_1402400) were associated with erythrocyte invasion. Four or higher copies of PvMSP3 and PvRBP2a genes were observed in the samples in Southeast Asia samples. https://doi.org/10.1371/journal.pntd.0008234.g005

PLOS NEGLECTED TROPICAL DISEASES
e.g., in the DBP1 gene was higher in isolates from Ethiopia than in isolates from Southeast Asia (S5 Table). A maximum likelihood tree constructed based on the whole genome sequences showed an admixture of P. vivax isolates with single and multiple PvDBP copy number (Fig 6A). The P. vivax isolates from Ethiopia were divided into six subclades. Subclade I contained P. vivax samples mostly from Arbaminch and Badowacho with both one and two PvDBP copies. Subclade II contained samples from Jimma and Hawassa with two PvDBP copies. Subclade III contained a mixture of P. vivax samples from Arbaminch, Halaba, Hawassa, and Jimma with single and high-order PvDBP copies. This clade was sister to subclade IV that contained P. vivax samples mostly from Jimma (Fig 6A). In subclade IV, no distinct clusters were detected between isolates with single and multiple PvDBP. Subclade V contained samples from Jimma and subclade VI contained samples from Arbaminch, Badowacho, Hawassa, and Halaba. Each of the subclades had samples with both one and two PvDBP copies. Similar patterns were observed in the MSP3.11 where P. vivax isolates with single and multiple copies were clustered together in separate subclades (Fig 6B), suggesting that this gene could have resulted from multiple duplication events.

Gene flow and transmission network of the Ethiopian P. vivax
The F ST statistics showed low levels of genetic differentiation between Arbaminch and Halaba (F ST value = -0.019), as well as between Arbaminch and Hawassa (F ST value = 0.017; S6 Table). By contrast, higher differentiation was observed between Jimma and Arbaminch (F ST value = 0.229) and between Jimma and Halaba (F ST value = 0.327; S6 Table). ADMIXTURE analysis showed that there were most likely two or three genetic clusters among all the Ethiopian samples based on the lowest cross validation scores of 0.75 and 0.78, respectively (Fig 7A). At K = 2, most of the samples from Badowacho, Hawassa, Arbaminch, and Halaba had predominantly the light-blue cluster, whereas samples from Jimma had the red cluster (Fig 7A). At K = 3, samples from Badowacho and Hawassa shared a mix of green and blue clusters; samples from Arbaminch had mostly the green cluster; samples from Halaba had mostly the blue cluster, similar to those from Jimma that had predominantly the blue cluster, though red and green clusters were also observed (Fig 7A). This clustering pattern is consistent with the F ST statistics and principal component analysis (S1 Fig), which showed the greatest genetic differences and lessen gene flow between Jimma and other study sites.
The transmission network indicated that Arbaminch was the major source of infections from which the infections in Jimma, Hawassa, Badowacho, and Halaba originated (Table 3; Fig 7B). In contrast, no transmission originated from Halaba, making this location the largest sink of transmissions. The greatest extent of transmission was observed between Arbaminch and Badowacho (Fig 7B). Hawassa and Jimma showed a source hub ratio of 0.5, indicating that there are equally as many egress transmissions as ingress transmissions (Table 3). Although Jimma and Badowacho/Halaba are in close geographical proximity, no apparent gene flow was observed between these sites.

Discussion
Across the genome, the total number of SNPs observed among 44 P. vivax isolates from Ethiopia were lower than those previously reported for P. vivax isolates from South American [80] and Southeast Asian countries, despite the different analytical tools used for SNP calling [19]. For instance, 303,616 high-quality SNPs were detected in 228 P. vivax isolates from Southeast Asia and Oceania in a previous study, of which Sal-I was used as the reference sequence and subtelomeric regions were discarded [19]. Auburn et al. [20] found that the average nucleotide diversity of P. vivax isolates from Ethiopia was lower than P. vivax isolates from Thailand and Indonesia, but higher than P. vivax isolates from Malaysia. Chromosomes 3, 4, and 5 have been previously shown to contain the lowest proportion of synonymous SNPs than the other parts of the genome [12]. In the present study, chromosomes 3 and 6 were found to have the lowest number of both synonymous and nonsynonymous SNPs. This follows observations made in other studies done with nucleotide diversity ranging from 0.8 SNPs per kb in North Korea to 0.59 SNPs per kb in Peru [81]. Among the 64 erythrocyte binding gene candidates, the MSP and RBP multigene families showed the highest level of genetic variation. This agrees with previous studies that reported a remarkably high diversity in RBP2 compared to RBP1 and its homolog group in P. falciparum [31]. In the Greater Mekong Subregion, the MSP3 and PIR gene families also indicated high levels of genetic diversity with 1.96% and 1.92% SNPs per base respectively, confirming that members of multigene families are highly variable genetically [30,82]. Such diversity suggested that the binding domains of these genes could be under differential selection pressure. This pattern has been observed in previous studies and is likely due to their critical role in reticulocyte invasion, immunogenic properties, and human migration [26, [83][84][85].
For the P. vivax isolates from Southeast Asia, copy number variation was observed previously in DBP1, ranging from 1 to 2-3 copies using the Sal1 reference genome [19]. In the present study, copy number variation was detected in six erythrocyte binding gene regions including DBP1, and members of the MSP3 and RBP gene families using the PVP01 reference genome. However, we found no evidence of duplications in the RBP gene family in the isolates form Ethiopia. Among the isolates from Ethiopia, the highest and most variable copy number variations were detected in the DBP, with copy numbers ranging from one to as high as five. Likewise, for the MSP3, copy numbers ranging from one to as high as four. Based on the phylogeny, DBP and MSP3 expansion had occurred multiple times as tandem copies. These findings were consistent with earlier studies [19,79] and suggested that gene expansion may play a key role in host cell invasion [86]. Despite a higher malaria transmission intensity in Southeast Asia than in Ethiopia, PvDBP only ranged from 1 to 2-3 copies among the Southeast Asian isolates, much less than that observed among the isolates from Ethiopia. The range of PvDBP copy variation may not be related to transmission intensity but host-parasite interaction during the invasion process. Given that Duffy phenotype is more diverse among the the human populations in Ethiopia than in Asia [87], it is possible that P. vivax expands DBP to multiple copies to enhance binding affinity and invasion to different host phenotype [79]. These expanded gene copies shared identical nucleotide sequences, suggesting rapid and recent duplications within an isolate [79,88]. Our phylogenetic analyses indicated that P. vivax with high-order DBP copies evolved from isolates with a single-copy DBP multiple times independently at different geographic locations. Isolates with high-order PvDBP copies may confer better fitness and favored by selection, resulting in a higher frequency in isolates of P. vivax from Ethiopia. Moreover, gene flow among geographical locations may allow further spread and a broad distribution of high-order PvDBP strains. For all other putative erythrocyte binding genes, only a single copy was detected among all samples. A larger sample in future investigations would verify this observation.
In the present study, we identified a panel of 43 putative erythrocyte binding gene candidates based on the information from the literature and multiple databases. We are currently validating the functions for each of these genes by binding assays based on the gene variants observed in the isolates from Ethiopia. Among these 43 putative erythrocyte binding gene candidates, MAEBL was shown to be highly conserved in Plasmodium [89] and had a significant signal for positive selection among the P. vivax samples from Ethiopia, agrees with the findings of a recent study [89]. In Plasmodium berghei, MAEBL is a sporozoite attachment protein that plays a role in binding and infecting the mosquito salivary gland [90]. In Plasmodium falciparum, MAEBL is found in the rhoptries and on the surface of mature merozoites and expresses at the beginning of schizogony [90]. In P. vivax, MAEBL is a conserved antigen expressed in blood stages, as well as in the mosquito midgut and salivary gland sporozoites [90,91]. The MAEBL antigen contains at least 25 predicted B-cell epitopes that are likely to elicit antibody -Fig 7. (A) Bar plot based on ADMIXTURE analyses of SNP variants showing the clustering pattern of samples among study sites in Ethiopia. Two or three genetic clusters were identified based on the cross-validation scores. At K = 2, most of the samples from Badowacho, Hawassa, Arbaminch, and Halaba had predominantly the light-blue cluster, whereas samples from Jimma had the red cluster. At K = 3, samples from Badowacho and Hawassa shared a mix of green and blue clusters; samples from Arbaminch had mostly the green cluster; samples from Halaba had mostly the blue cluster, similar to those from Jimma that had predominantly the blue cluster, though red and green clusters were also observed. (B) The transmission network, created using the StrainHub program, indicated that Arbaminch was the major source of infection in Jimma, Halaba, Badowacho and Hawassa. The greatest extent of transmission (indicated by the boldest arrow) was observed between Arbaminch and Badowacho. Even though Jimma, Badowacho and Halaba are geographically in close proximity, transmission was not intense among these sites.
https://doi.org/10.1371/journal.pntd.0008234.g007 dependent immune responses [92]. Thus, positive selection observed in this gene region among the P. vivax isolates from Ethiopia could be associated with the immunity-mediated selection pressure against blood-stage antigens. Apart from MAEBL, positive selection was also detected in the MSP3 gene among the P. vivax isolates from Ethiopia and may have important implications on the susceptibility of human hosts [93]. In P. vivax, the MSP3 gene and its paralogs such as MSP3-alpha and MSP3-beta on chromosome 10 have been shown to simultaneously express in the blood stage merozoite and are immunogenic [26]. These paralogs may have functionally redundant roles in determining antigenicity [26]. Extensive polymorphisms have been reported throughout the gene family, likely through frequent recombination and gene conversion between the PvMSP3 paralogs [26,94]. The central low complexity regions of PvMSP3, where indels and high level of polymorphisms were observed, are highly immunogenic compared to the more conserved N-C terminals [93,95]. Thus, selection for SNP variants in PvMSP3 and significant expansion of the gene family in the P. vivax isolates from Ethiopia may relate to the increased capability of the red blood cell invasion process and immune evasion [96,97], though no significant selection was detected in other geographical isolates [19,89]. While DBP1 had the highest and most diverse copy number variation, no significant signal of positive selection was detected, consistent with the findings in P. vivax isolates from from Western Cambodia, Western Thailand, and Papua Province in Indonesia [19,20]. A number of antimalarial drug resistance genes including the chloroquine resistance transporter (CRT-O) gene, the dihydropteroate synthase (DHPS) gene, and dihydrofolate reductase-thymidylate synthase (DHFR) gene were previously reported with positive selection [20,89], but we did not detect such in this study. Sequences of PvCRT-O and PvMDR1 have been shown to be highly conserved in P. vivax isolates from Ethiopia [53]. Broader samples are needed to further examine the role of these genes in chloroquine resistance. It is noteworthy that the calculation of integrated haplotype scores and the accuracy of phasing genotypes using BEAGLE were dependent on the levels of linkage disequilibrium of the whole genomes. The higher the levels of linkage disequilibrium, the more accurate are the phased genotypes and thus the iHS score. Pearson et al. [19] found that P. vivax experienced drops in linkage disequilibrium after correcting for population structure and other confounders. Linkage disequilibrium of P. vivax genomes has been previously shown to be associated with the rate of genetic recombination and transmission intensity [98][99][100]. In high transmission sites of Papua New Guinea and the Solomon Islands, no identical haplotypes and no significant multilocus LD were observed, indicating limited inbreeding and random associations between alleles in the parasite populations [101,102]. However, when transmission intensity declined, similar haplotypes and significant LD were observed possibly due to self-fertilization, inbreeding and/or recombination of similar parasite strains [98]. Multilocus LD is significantly associated with the genetic relatedness of the parasite strains [103], but inversely associated with the proportion of polyclonal infections [104]. In Southwestern Ethiopia, malaria transmission ranged from low to moderate, and LD levels varied markedly among the study sites [53,105]. To address this limitation in BEAGLE, all genes that were detected with positive selection in BEAGLE were further analyzed with PAML for verification. Future study should include broad samples to thoroughly investigate selection pressure at the population level and the function significance of polymorphisms in the MSP3 genes.
Previous studies have shown high levels of genetic diversity among P. vivax isolates from endemic countries, such as Papua New Guinea, Cambodia, and Myanmar [16,106,107]. Such a diversity was directly related to high transmission intensity and/or frequent gene exchange between parasite populations via human movement [4,12,13,53]. For example, previous studies using microsatellites have demonstrated a consistently high level of intra-population diversity (H E = 0.83) but low between-population differentiation (F ST ranged from 0.001-0.1] in broader regions of Ethiopia [53,105]. High heterozygosity was also observed in P. vivax populations from Qatar, India, and Sudan (average H E = 0.78; 62), with only slight differentiation from P. vivax in Ethiopia (F ST = 0.19) [108]. Frequent inbreeding among dominant clones [98,101] and strong selective pressures especially in relapse infections [19,20,108,109] may also contribute to close genetic relatedness between and within populations. Thus, in this study, it is not surprising to detect a high level of parasite gene flow among the study sites at a small geographical scale, despite the limited number of samples. In the present study, we successfully employed a transmission network model to identify transmission paths, as well as the source and sink of infections in the region, beyond simply indicating genetic relationships. The genetic differences observed between Jimma and the other sites could be explained by the amount of human movements. Sites such as Arbaminch and Hawassa in Ethiopia are well known sight-seeing areas famous of their natural scenery and heritage sites that attract large number of local and foreign tourists every year, whereas Jimma is a local residential area where fewer people from outside visit.
To conclude, this study elaborates the genomic features of P. vivax in Ethiopia, particularly focusing polymorphisms in erythrocyte binding genes that potentially play a key role in local parasite invasion, a critical question given the mixed Duffy positive and negative populations of Ethiopia. The findings provided baseline information on the genomic variability of P. vivax infections in Ethiopia and allowed us to compare the genomic variants of P. vivax between Duffy-positive and Duffy-negative individuals as the next step of our ongoing investigation. Further, we are in progress of developing a panel of informative SNP markers and using them on larger sample sets to track transmission at a micro-geographical scale.
Supporting information S1