Neutral Polymorphisms in Putative Housekeeping Genes and Tandem Repeats Unravels the Population Genetics and Evolutionary History of Plasmodium vivax in India

The evolutionary history and age of Plasmodium vivax has been inferred as both recent and ancient by several studies, mainly using mitochondrial genome diversity. Here we address the age of P. vivax on the Indian subcontinent using selectively neutral housekeeping genes and tandem repeat loci. Analysis of ten housekeeping genes revealed a substantial number of SNPs (n = 75) from 100 P. vivax isolates collected from five geographical regions of India. Neutrality tests showed a majority of the housekeeping genes were selectively neutral, confirming the suitability of housekeeping genes for inferring the evolutionary history of P. vivax. In addition, a genetic differentiation test using housekeeping gene polymorphism data showed a lack of geographical structuring between the five regions of India. The coalescence analysis of the time to the most recent common ancestor estimate yielded an ancient TMRCA (232,228 to 303,030 years) and long-term population history (79,235 to 104,008) of extant P. vivax on the Indian subcontinent. Analysis of 18 tandem repeat loci polymorphisms showed substantial allelic diversity and heterozygosity per locus, and analysis of potential bottlenecks revealed the signature of a stable P. vivax population, further corroborating our ancient age estimates. For the first time we report a comparable evolutionary history of P. vivax inferred by nuclear genetic markers (putative housekeeping genes) to that inferred from mitochondrial genome diversity.


Introduction
Plasmodium vivax is the most prevalent malaria species outside Africa, causing widespread morbidity that can be severe and fatal [1,2,3,4,5,6]. The fixation of the Duffy negativity trait (P. vivax resistance factor) in tropical African populations restricts infection by this parasite there [7]. P. vivax is distantly related to the more virulent Plasmodium falciparum, as inferred from mitochondrial and nuclear DNA variation [8,9,10,11]; the former appears to have evolved from a monkey malaria parasite between 20-35 million years ago via host switching [12]. Various studies on the age and origin of extant P. vivax strongly support it as an ancient human malaria parasite that evolved in Southeast Asia [12,13,14,15].
Single nucleotide polymorphisms (SNPs) have been used by several researchers to elucidate the population history of P. falciparum and P. vivax [12,16,17,18]. These studies established the importance of using selectively neutral loci for determining the molecular age, origin, and evolutionary history of the parasites, assuming a ''molecular clock'' hypothesis. Many population studies have exploited polymorphisms in two kinds of neutral loci: SNPs in putative housekeeping genes, and repeat polymorphisms in tandem repeats such as microsatellites [19,20,21]. Genome sequencing of P. vivax has revealed a higher SNP density [22] and fewer microsatellite markers than in P. falciparum [23,24].
India contributes about 78% of the total malaria cases in South Asia, and P. vivax accounts for 50-55% of this [25]. The paucity of information about the age and evolutionary history of P. vivax in Indian populations [15,26] makes such investigations highly relevant. In this paper, we use an evolutionary genetics approach to unravel the evolutionary history of P. vivax on the Indian subcontinent. We have developed putative housekeeping gene, microsatellite, and minisatellite markers for investigating the population and evolutionary history of P. vivax on the Indian subcontinent.

Identification of putative housekeeping genes
Putative housekeeping genes were identified from the P. vivax genome database PlasmoDB (http://www.plasmodb.org). The major parameters used for selecting housekeeping genes from PlasmoDB were constitutive expression of gene, presence of orthologs in other malaria parasites, and an assigned metabolic function. For the detection of SNPs in putative housekeeping genes, both exons and introns from each gene were selected. PCR primers were manually designed and the location of each primer is shown in Figure 1.

Plasmodium vivax isolates and DNA extraction
We analyzed 100 P. vivax field isolates collected between 2003-2006 from Delhi, Nadiad (Gujarat), Panna (Madhya Pradesh), Chennai (Tamil Nadu), and Kamrup (Assam), a total of twenty samples from each site ( Figure S1). Human populations at these study sites do not have the Duffy negative trait [27]. Other details about the study sites such as parasite and vector species prevalence and disease transmission patterns are given in the Text S1 and reported elsewhere [28,29]. DNA extraction was as described in reference [28,29].

Ethics statement
The ethics committee of the National Institute of Malaria Research approved the study protocol. All subjects provided informed consent, with children providing consent via a parent or guardian.
PCR amplification, sequencing, and sequence analysis PCR amplification reactions were carried out in a final volume of 20.0 ml that included 1-2 ml (,3-5 ng) template DNA, 10 pM for each primer, and 26Master Mix (Promega). PCR primers and the protocols used for amplification of the housekeeping genes are given in Table S1. PCR products were purified and sequenced commercially (Macrogen Inc, Seoul, Korea, http://dna. macrogen.com). For each gene, both forward and reverse primers were used in DNA sequencing. DNA Lasergene software (DNA Star Inc., USA) was used for editing raw DNA sequences (EditSeq Module), and sequences alignment (Clustal W module). Each mutation observed in a sample was confirmed by both forward and reverse sequences.

Single clone infection typing
As multi-clone isolates could hamper correct genotyping and lead to over-estimating genetic diversity in a multi-locus genotyping study, Pvmsp-3a PCR-RFLP analysis was used to identify single-and multi-clone infections [30]; only single clone samples (n = 100) were used for genotyping.

Tandem repeat genotyping
Tandem Repeat Finder version 4.00 [31] was used to identify microsatellites from the P. vivax genome. The characteristics of microsatellites are listed in Table S2. Each forward primer was modified with fluorescence dye, 6FAM, TET and HEX for accurate microsatellite allele sizing. Amplified PCR products were multiplexed with two different dyes and run on a DNA sequencer (ABI 3730: Applied Biosystems Inc. USA). Peak Scanner (Applied

Author Summary
Plasmodium vivax is the most prevalent human malaria parasite outside Africa, responsible for significant morbidity, although it is commonly referred to as a benign malaria agent with respect to P. falciparum. Inferring the evolutionary history of an infectious agent provides important information for designing control measures. In this study, we developed a panel of novel molecular markers including housekeeping genes, minisatellites and microsatellites to infer the evolutionary history of P. vivax on the Indian subcontinent. The panel of markers uncovered an ancient and long evolutionary history and a stable population structure of P. vivax in India.
Biosystems, Foster City, CA) was used for microsatellite allele sizing.

Measures of genetic diversity and neutrality
Genetic diversity was measured in the P. vivax population using parameters including nucleotide diversity (p), haplotype diversity (Hd) and average number of nucleotide differences (K) using DnaSP version 4.10 [32]. Microsatellite genetic diversity was measured by calculating expected heterozygosity (He) for each locus using MicroSatellite Analyzer (MSA) version 4.00 [33]. Expected heterozygosity (He) where n is the number of isolates analyzed and pi is the frequency of the i th allele in the population.

Genetic differentiation test
We estimated genetic differentiation between five geographical populations of P. vivax in the Indian subcontinent using DnaSP version 4.10 [32]. This software calculated chi-square (x 2 ) statistics of housekeeping gene haplotype frequencies between populations [34]. The significant P-value (,0.05) of x 2 statistics rejects null hypothesis. The null hypothesis assumes that all populations are genetically undifferentiated.

Test of neutrality
We calculated D statistics [35,36] and ratio of d N /d S to detect signature of selection at putative housekeeping genes using DnaSP version 4.10 [32]. The number of segregating sites was used for inferring D test statistics. To establish the nature of selection at codons of putative housekeeping genes, we assessed the ratio of the rates of non-synonymous and synonymous substitutions.
Estimation of divergence, TMRCA, and effective population size The divergence time of human and monkey has been deduced as 30-35 million years ago (mya) [37,38] which is consistent with 11-41 mya divergence time between P. vivax and nonhuman primate malarias [8]. We presumed a 23-30 mya divergence time between P. vivax and P. knowlesi that is coincident with their host divergence time [37,38]; this divergence time was used in the estimation of mutation rate and synonymous divergence rate by comparing orthologous putative housekeeping genes of P. vivax and P. knowlesi. In addition, the recent divergence estimates between P. vivax and P. knowlesi (2 to 3 mya and 3.8 to 6.3 mya) was assumed on the basis of radiation of Asian macaques [13,39]. These divergence estimates were also used for the calculation of the above molecular rates.
We estimated the Time to Most Recent Common Ancestor (TMRCA) and effective population size using SNPs present in putatively selectively neutral housekeeping genes. The TMRCA of P. vivax was determined by the equation d S = 2 m s 6Tw, where d S is synonymous substitution rate at synonymous site, m s is synonymous mutation rate per site per year and Tw is TMRCA [18]. Synonymous mutation rate was determined by the equation Ks = 2 m s 6Tb+d S [18], where Ks is the divergence at synonymous sites between species (P. vivax and P. knowlesi), d S is the nucleotide diversity (polymorphism) within species (P. vivax), m s is synonymous mutation rate per generation and Tb is time of the divergence between species (P. vivax and P. knowlesi). Effective population size of extant P. vivax was determined by the equation h = 4Nem (for diploid organisms) and h = 2Nem (for haploid and mitochondrial genomes), where h (theta) is the mutation parameter, Ne the effective population size, and m is the mutation rate per site per year.  Bottleneck/stable population detection Heterozygosity deficiency and mode shift analyses were used to detect evidence of a recent population bottleneck using BOTTLENECK software 1.2.02 [40]. In brief, a recently bottlenecked population would show higher observed gene diversity than the expected equilibrium gene diversity (Heq) which was computed from the observed number of alleles (k), under the assumption of a constant-size (equilibrium) population [41]. A two-phase mutation model (TPM) was used for heterozygosity deficiency analysis at mini-and microsatellite loci. TPM is an intermediate between the stepwise mutation model (SMM) and the infinite allele model (IAM), mostly consisting of one-step mutations having a small percentage (5-10%) of multi-step changes, as per BOTTLENECK software [41]. The allele frequency distribution analysis (mode shift analysis) showed whether it was approximately L-shaped (as expected under mutation-drift equilibrium) or not (recent bottlenecks provoke a mode shift). Bottleneck analysis can be performed using tandem repeat polymorphism data; the minisatellite polymorphism data being used in this study has been taken from earlier published work [28]. We also used network analysis of tandem repeat polymorphisms to detect possible signatures of bottlenecked or stable populations [42]. Two minisatellites located on a single chromosome (Chr 6), were used to construct reduced-median (RM) network of haplotypes using Network software [42]. In brief, a star-like network indicates a recent population expansion, whereas the absence of a star-like network indicates an ancient population expansion.

Estimation of expected and observed pairwise differences
The estimation of expected and observed frequencies of pairwise differences at nucleotide sites provides a signature of whether a population is stable or has recently experienced a population bottleneck [36]. In summary, an L-shaped curve between observed and expected allele frequencies spectrum indicates a stable population whereas a non L-shaped curve

Screening the genome for SNPs and tandem repeats
We screened the P. vivax Salvador 1 reference genome and identified ten putative housekeeping genes, and ten minisatellites and eight microsatellites, to infer the evolutionary history of this parasite in the Indian subcontinent. The selected housekeeping genes are exonuclease domain (ed), adenylate cyclase (ac), serine/threonine protein kinase (stpk), acyl carrier protein (acp), ribosomal protein l35e (l35e), 60S ribosomal protein l34a (l34 a), DNA gyrase (dg), calciumdependent protein kinase (cdpk), enolase and RNA polymerase-II (Rpol). The structure of the housekeeping genes (arrangement of introns and exons), gene identifier and fragments used for SNP identification are given in Figure 1 and Table 1. Approximately 400 to 600 bp of intron and exon sequence were used to identify SNPs. However, introns were not present in two housekeeping genes (ac and ed), therefore, partial exon sequence of these were used. The selected minisatellites and microsatellites are located at six different chromosomes and their features are listed in Table S2. Among the eight microsatellites, a single microsatellite (Gomez_1) was reported earlier by Gomez et al [43].

Putative housekeeping gene polymorphisms
Ten housekeeping genes were successfully amplified and sequenced to at least two-fold coverage. Analysis of the sequences revealed size polymorphisms in four of the housekeeping genes and two to six size variants were observed. These size variants were due to indels (l35e) or tandem repeat variation (l34a, DNA gyrase) in the gene's repeat region ( Figure S2). Analyzing 9,298 nucleotide sites revealed a substantial number of SNPs (n = 75) in both coding and non-coding regions ( Table 1). The number of synonymous, non-synonymous and non-coding SNPs observed, are listed in Table 1. On average, putative housekeeping genes had a low level of nucleotide (p = 0.00207660.00059) and haplotype diversity (Hd = 0.58946 0.108) in P. vivax field isolates. Of the 75 SNPs identified, 55 were putatively selectively neutral (13 synonymous and 42 non-coding) and 20 were non-synonymous ( Table 1). Sequencing of the Pvcdpk gene revealed presence of tandem repeats in the central region of gene, which caused difficulty in aligning the sequences correctly, and was therefore excluded from further analysis.

Putative housekeeping genes are selectively neutral
The neutrality of housekeeping genes was inferred from the results of three tests. 1) The distribution of SNPs among exons (n = 33) and introns (n = 42) of putative housekeeping genes seems to be random (x 2 = 0.572, p = 0.449) suggesting a lack of functional constraint on exons. 2) By Tajima's D test, none of the putative housekeeping genes showed signs of significant deviation from neutrality ( Table 2); however, Fu and Li's D test showed biased singleton mutation distribution in two of the nine putative housekeeping genes; these two genes were enolase (D = 22.920 p = ,0.004) and l35e (D = 23.617, p = ,0.004). 3) The d N /d S ratio showed a signature of purifying selection for two putative housekeeping genes (l35e and dg), whereas mutations in the remaining genes were selectively neutral ( Table 2). Further, none of the housekeeping genes showed significant departure from neutrality by the above three tests conducted together. Thus, these tests suggest that these housekeeping genes are primarily evolving in a neutral fashion and therefore can be employed as genetic markers for inferring population and evolutionary history of P. vivax.

Lack of geographical structuring between populations
To understand whether P. vivax isolates are structured in populations, we undertook a genetic differentiation test using individual housekeeping genes. Our null hypothesis is that populations of P. vivax in the Indian subcontinent are genetically un-differentiated. Analysis of seven of the eight housekeeping genes accepted this hypothesis, while only DNA gyrase rejected the null hypothesis ( Table 3). Since DNA gyrase does not appear to be neutral by neutrality tests ( Table 2), it may not be ideal for measuring genetic differentiation between P. vivax populations. We confirmed the suitability of using chisquare statistics for genetic differentiation by determining the critical value of haplotype diversity (,0.95) [34]. The haplotype diversity measured at each housekeeping gene was lesser than the critical value (Hd = 0.169-0.864) ( Table 3), confirming suitability of the use of chi-square statistics for genetic differentiation. Based on the seven selectively neutral housekeeping genes, it appears that different populations of P. vivax are not geographically structured in India. Thus, P. vivax isolates collected from different geographical regions of the Indian subcontinent will not have an adverse impact on the pooled analyses of TMRCA, effective population size, and population bottleneck. Estimation of synonymous divergence rate and mutation rate Since P. vivax is a close relative of P. knowlesi and the genome sequence of the latter has been determined [44], orthologous genes from P. knowlesi could be used to determine the genetic distance (divergence) between P. knowlesi and P. vivax and the pattern of evolutionary pressure (positive or negative selection) on these putative housekeeping genes. An approximately tenfold higher synonymous divergence rate over the non-synonymous divergence rate was observed between P. vivax and P. knowlesi ( Table 2), suggesting that the putative housekeeping genes are evolving strictly under purifying selection. The average synonymous divergence and mutation rate for P. vivax was determined under various divergence time scenarios between the P. knowlesi and P. vivax. The estimated rates of synonymous divergence and mutation are given in Table 4. Based on this range, the synonymous divergence rate (per site per year) for the putative housekeeping genes was calculated as 8.61610 29 (m sa ) and 6.60610 29 (m sb ) under 23 mya and 30 mya divergence scenarios, respectively. Similarly, the mutation rates (per site per year) under the 23 mya and 30 mya divergence scenarios were determined to be 6.51610 29 (m a ) and 4.99610 29 (m b ), respectively. In contrast, few studies suggested relatively recent divergence time of P. vivax viz 2-3 mya [10] and 3.8-7 mya [13] and based on these divergence time mutation rates were also estimated. Under recent divergence time scenario, synonymous divergence rate (m s ) and mutation rate (m) are calculated as above, and listed in Table 4.

TMRCA and effective population size (Ne) estimates
We determined a range of TMRCA for extant P. vivax in India that indicate for recent (20,202 to 30,303 years) or ancient (232,288 to 303,030 years) existence, based upon assumptions of recent or ancient divergence (Table 5). Similarly, on the basis of higher and lower mutation rate estimates, the effective population size estimate of extant P. vivax was obtained as recent (Ne = 6,929 to 10,400) and long-term (Ne = 79,235 to 104,008), respectively ( Table 4).

Plasmodium vivax populations are stable
We tested the housekeeping gene polymorphism data to see if there were detectable differences between the observed and expected allele frequency, which might indicate an unstable population structure of P. vivax in India. Assuming constant population equilibrium, we determined that the observed allele frequency spectrum was similar to the expected one ( Figure 2). In addition, we calculated the pairwise difference analysis for all samples and individual populations (Figure S3), and also found the observed allele frequency spectrum followed the expected frequency. Thus this analysis indicates the stable nature of P. vivax populations in the Indian subcontinent.

Tandem repeat variability and population bottleneck
Seven microsatellites, three from one chromosome (Chr 6), one each from four chromosomes (Chr 2, Chr 5, Chr 7 and Chr10), were identified from the P. vivax genome sequence, and a subset of isolates (n = 40) was tested by microsatellite analysis. A substantial number of alleles (range 6-13, AE = 8.6261.16) and high expected heterozygosity (He = 0.65 to 0.90, AE = 0.78960.039) were observed for each locus. Using data that we had previously generated for ten minisatellites on the same sample set used here [28], bottleneck analysis (heterozygote deficiency and allele frequency mode shift tests) was undertaken to detect whether the extant P. vivax population in the Indian subcontinent reflects a *Mu et al. [15] #Cornejo et al.  Evolutionary History of Plasmodium vivax in India signature of a recent population bottleneck or of a more ancient population. A majority of loci (80%) showed heterozygosity deficiency ( Table 6), suggesting that they are in expected genetic diversity equilibrium. Similarly, mode shift analysis revealed a strictly L-shaped allele frequency distribution ( Figure 3). The high heterozygosity deficiency and L-shaped allele frequency distribution at minisatellites suggests that the Indian P. vivax population has not undergone a recent population bottleneck. This supports our ancient age estimates of P. vivax inferred from putative housekeeping genes.

Network analysis of tandem repeat loci
We constructed a reduced-median (RM) network of tandem repeat-based haplotypes to understand whether P. vivax populations in India expanded recently (which would produce a star-like network). A RM haplotype network derived from more than two tandem repeat loci produces a highly complex network, therefore, we limited our analysis to two tandem repeat loci on P. vivax Chr 6. No star-like network of haplotypes was produced (Figure 4), suggesting an ancient population expansion of P. vivax has occurred in the Indian subcontinent.

Discussion
This is the first study concerning P. vivax evolutionary history on the Indian subcontinent using SNPs in putative housekeeping genes and minisatellite and microsatellite markers. In this study we identified 1) polymorphisms in Indian P. vivax populations at neutral genetic loci; 2) neutral housekeeping genes suitable for inferring the evolutionary history of P. vivax in India; 3) no geographical structuring of P. vivax populations; and 4) an ancient and stable population of P. vivax on the Indian subcontinent.
The high degree of genetic polymorphism observed in tandem repeat loci agrees with earlier studies which revealed tremendous genetic polymorphism among global P. vivax isolates [22,43,45,46,47,48]. This study successfully identified neutral genetic variation in several genetic loci, and this renders them highly useful as molecular markers for population structure studies. The high density of putatively neutral SNPs observed in housekeeping genes among Indian P. vivax isolates is in accordance with observations of Feng et al. [22], suggesting SNPs in P. vivax isolates are comparatively more common than in P. falciparum [17,22]. Recently this was confirmed by deep sequencing that showed a higher genetic variability in P. vivax than P. falciparum [49]. Our neutrality test results imply that most of the housekeeping genes are free from directed selection pressure. Neutral evolution of the housekeeping genes described here argues for their utility as molecular markers for understanding demographic events and population history of P. vivax. We also tested whether pooling the P. vivax isolates collected from five geographically disparate populations might have adversely impacted the  Evolutionary History of Plasmodium vivax in India analyses of TMRCA and effective population size estimations, due to geographic structuring of subpopulations. However, we found no evidence for significant divergence between the populations, suggest that pooling of samples will not have an adverse impact on the analyses of TMRCA and effective population size. The absence of population structure between these five populations was also confirmed in an earlier study on the basis of genetic distance data derived from ten minisatellite loci [28].
Using the neutral polymorphisms of several housekeeping genes, we deduced that the MRCA of extant P. vivax was present on the Indian subcontinent about 232,228 to 303,030 years ago. This ancient TMRCA is strongly supported by long-term effective P. vivax population size (Ne) estimates and stable population indexes. The stable population index of P. vivax in Indian populations was established earlier using other neutral genetic loci [26] that also supports our older age estimate of this species. This ancient evolutionary history of extant P. vivax in India is similar to the molecular age inferred from mitochondrial genome diversity, i.e., 162,400-464,600 years [14,15,50]. Our TMRCA estimate strongly supports the hypothesis that P. vivax was a hominoid primate parasite before it became a human parasite by means of a host switch [12,15,50] and that it has evolved in Southeast Asia. The alternative hypothesis claiming an African origin for P. vivax is based on the fixation of the Duffy negative trait in the black African population. However, this alternate hypothesis is fraught with inconsistencies. For example, the Duffy negative trait arose around 100,000 years ago [51]; this time scale is shorter than the P. vivax TMRCA estimated in the present study and others [14,15,50]. The Duffy receptor is used for invasion by many pathogens including bacteria [52] and viruses [53,54,55]; thus, P. vivax is not necessarily the only evolutionary force for the fixation of this trait in the West African native population. Finally, P. vivax infection of Duffy negative people [56,57,58] suggests the existence of alternate invasion mechanisms, such as occur in P. falciparum [59,60,61].
In contrast, a much younger age estimate of P. vivax was determined under a recent divergence model of P. vivax and P. cynomolgi [10] coincident with the radiation of Asian macaques 2-3 mya [39]. This assumption implies that the MRCA of extant P. vivax dates to just 20,202-30,303 years ago. It is of interest to mention that under this recent divergence model [10], P. vivax shows a ten-fold higher mutation rate than P. falciparum [17], and it is unlikely that such a high mutation rate difference exists between related species. The radiation of Asian macaques was inferred from analysis of a mitochondrial gene, which may be evolving at a different rate than nuclear genes. Combined, our population genetic analyses such as pairwise difference, allele frequency, reduced-median network of haplotypes, and heterozygote deficiency conducted in the present study do not reflect the signature of a recent population expansion or bottlenecked population of P. vivax in the Indian subcontinent. Therefore, an ancient divergence time (23-30 mya) of P. vivax and P. knowlesi, seems more plausible.
In conclusion, our findings reveal that P. vivax isolates have an ancient evolutionary history in the Indian subcontinent. For the first time we report that neutral nuclear genome markers (housekeeping genes) displayed an evolutionary history of P. vivax similar to that inferred from mitochondrial genome diversity.  Text S1 Details of study sites. (DOC) Institutes of Health/Fogarty International Center, Global Infectious Disease training grant (D43 TW007884). SKP is an ICMR-Postdoctoral Fellow. The authors thank Dr. Steven Sullivan for proof reading, Dr. Simon Kang'a for technical support, and NIMR scientists, staff in the Molecular Biology Division, and at NIMR field units for their support and cooperation during the study. The authors would like to dedicate this manuscript to the memory of Dr. Hema Joshi (deceased). The content is solely the responsibility of the authors and does not necessarily represent the official views of the Fogarty International Center or the National Institutes of Health.