Plasmodium vivax-like genome sequences shed new insights into Plasmodium vivax biology and evolution

Although Plasmodium vivax is responsible for the majority of malaria infections outside Africa, little is known about its evolution and pathway to humans. Its closest genetic relative, P. vivax-like, was discovered in African great apes and is hypothesized to have given rise to P. vivax in humans. To unravel the evolutionary history and adaptation of P. vivax to different host environments, we generated using long- and short-read sequence technologies 2 new P. vivax-like reference genomes and 9 additional P. vivax-like genotypes. Analyses show that the genomes of P. vivax and P. vivax-like are highly similar and colinear within the core regions. Phylogenetic analyses clearly show that P. vivax-like parasites form a genetically distinct clade from P. vivax. Concerning the relative divergence dating, we show that the evolution of P. vivax in humans did not occur at the same time as the other agents of human malaria, thus suggesting that the transfer of Plasmodium parasites to humans happened several times independently over the history of the Homo genus. We further identify several key genes that exhibit signatures of positive selection exclusively in the human P. vivax parasites. Two of these genes have been identified to also be under positive selection in the other main human malaria agent, P. falciparum, thus suggesting their key role in the evolution of the ability of these parasites to infect humans or their anthropophilic vectors. Finally, we demonstrate that some gene families important for red blood cell (RBC) invasion (a key step of the life cycle of these parasites) have undergone lineage-specific evolution in the human parasite (e.g., reticulocyte-binding proteins [RBPs]).

4 malariae--like, its ape relative. Similar to other ape--infective Plasmodium species, P. vivax--like exhibits far higher levels of genetic diversity than its human--infective relative, P. vivax. Finally, our genome--wide analyses provide new insights into the adaptive evolution of P. vivax. Indeed, we identify several key genes that exhibit signatures of positive selection exclusively in P. vivax, and show that some gene families important for red blood cell invasion have undergone species--specific evolution in the human parasite.

Genome Assemblies
Eleven P. vivax--like genotypes were obtained from two different kinds of samples: ten infected chimpanzee blood samples collected during successive routine sanitary controls of chimpanzees living in the Park of La Lékédi (a sanctuary in Gabon) and one infected Anopheles mosquito collected during an entomological survey realized in the same park 11 (Supplementary Table 1). For blood samples, white blood cells were depleted using the CF11 method 12 Table 1). Four samples containing P. gaboni or P.
In order to obtain the P. vivax--like genotypes, sequencing reads were extracted based on their similarity to the reference genome sequence of P. vivax, PvP01 15 . Sequencing reads from two samples, one obtained using Illumina sequencing, Pvl01, and another using PacBio technology, Pvl06, were used to perform de novo genome assemblies and annotated to produce reference genomes for P. vivax--like (Supplementary Table 2). Of the two assemblies, Pvl01 is of considerably higher quality with 4,570 orthologues to the PvP01 reference genome compared to 2,925 for Pvl06 (Table 1). Both assemblies consist of 14 supercontigs (corresponding to the 14 P. vivax chromosomes) and, respectively for Pvl01 and Pvl06, 226 and 370 unassigned contigs, comprising a total of 27.5Mb and 18.8Mb in size for Pvl01 and Pvl06 respectively. After annotation with Companion 16 , these two genomes contained 5,532 and 4,953 annotated genes (Table 1). 5 The genome sequences obtained from the other samples were used for SNP calling, and population genetic and phylogenetic analyses.

Gene synteny and gene composition
Comparing the P. vivax--like reference genomes with those of P. vivax (PvP01 and SalI) 2,15 , P. cynomolgi (B strain) 8  Plasmodium genomes studied 18,19 . hosts. This would be in accordance with the fact that the only described transfer of P.
vivax--like to humans was in a Caucasian Duffy positive individual 9 and that no transfers of P. vivax--like were recorded in Central African Duffy negative populations despite the fact that they live in close proximity with infected ape populations 26 . RBP genes encode a merozoite surface protein family present across all Plasmodium species and known to be involved in erythrocyte invasion and host specificity 22 . Comparison of the organization and characteristics of this gene family between P. vivax, P. vivax--like, P. knowlesi and P. cynomolgi ( Table  2), reveals that: 1) all gene classes (RBP1, RBP2 and RBP3) are ancestral to the divergence of these species; 2) the expansion of RBP2, a class of genes that confer the ability to infect reticulocytes (immature red blood cells), was likely ancestral to the P. vivax/P. vivax--like lineage and 3) RBP3, which is a class of genes supposed to confer the ability to infect specifically normocytes (mature red blood cells), are functional in all species except in P. vivax (where the gene is pseudogenised), suggesting that P. vivax specifically lost the ability to infect this category of erythrocytes during its adaptation to humans ( Figure 1).

Phylogenetic relationships to other Plasmodium species and divergence time
Conservation of the gene content between P. vivax--like with the other primate-- infective Plasmodium species has enabled us to reconstruct with confidence the relationships between the different species and to estimate the age of the different speciation events. This analysis confirmed the position of P. vivax--like as the closest sister lineage of P. vivax ( Figure 2). Regarding the divergence time, with the fix point considered as the relative distance of the split of the speciation of the two P. ovale genomes and the P. malariae--like and P. malariae 14 , relative split times between Plasmodium species were estimated using the method of Silva et al. 27 . Assuming consistent mutation rates and generation times across the branches, we observed that the time of the split between P. vivax and P. vivax-like is about three times more recently than than the split between P. ovale wallikeri and P. ovale curtisi and 1.5 times earlier than Plasmodium malariae and Plasmodium malariae--like. It has to be noticed that the split between these two last species was estimated to be five times earlier than the one between P. ovale wallikeri and P. ovale curtesi, which is consistent with the study of Rutledge et al. 14 . However, it has to be confirmed with other dating methods, because the GC content could bias this estimation.
Indeed, the models used in our study assume a strict molecular clock, which would not apply to all Plasmodium species, specifically for P. falciparum because of its extreme GC content in comparison to other Plasmodium species.

Relationships to worldwide human P. vivax isolates
To analyse the relationship between our 11 P. vivax--like isolates with human P.
All sequencing reads were aligned against the PvP01 reference genome 15  One explanation for this difference with previous results could be due to a phenomenon called Incomplete Lineage Sorting (ILS). ILS is the discordance observed between some gene trees and the species or population tree due to the coalescence of gene copies in an ancestral species or population 29 . Such a phenomenon is often observed when species or population divergence is recent, which is the case for P. vivax/P. vivax--like 30,31 . ILS may thus result in the wrong conclusion of P. vivax and P.
vivax--like populations being intermixed and P. vivax diversity being included in the diversity of P. vivax--like. In our study, the use of a lot more genetic information localized throughout the genome, both in genic and intergenic regions, allows us to reduce this effect of ILS and reflects a more accurate picture of the genetic relationship between the different parasite species. Another explanation to these contradictory results would rely on the analysis of P. vivax--like samples collected only in Gabon or Cameroon, which limit the access to the full genetic diversity of these parasites. Clearly for now, the origin, the direction of the transfer and the evolutionary history of these parasites are still unclear and need addition of more P. vivax--like samples from other locations to be elucidated.
Our results also show that P. vivax--like is composed of two distinct lineages: the one including the two reference genomes (Pvl01 and Pvl06) and seven other isolates that will hereafter be referred as P. vivax--like 1 and another one including two isolates

P. vivax specific adaptive evolution
Comparison of the P. vivax genome to its closest sister lineage (P. vivax--like) and to the other primate Plasmodium provides a unique opportunity to identify P. vivax specific adaptations to humans. We applied a branch--site test of positive selection to detect events of positive selection that exclusively occurred in the P. vivax lineage.
Within the reference genome P. vivax--like (Pvl01), 418 genes exhibited significant signals of positive selection (Supplementary Table 3). In the human P. vivax genome PvP01, the test allowed the identification of 255 genes showing significant signals of positive selection (Supplementary  Table  4). Among these genes presenting a significant dN/dS ratio, 71 were shared between P. vivax and P. vivax--like, including 56 encoding for proteins with unknown function, and 15 encode proteins that are involved either in energy metabolism regulation (n = 9), in chromatid segregation (n = 2) or cellular--based movement (n = 4). We then took into consideration the genes detected under positive selection in P.
falciparum 13 and compared them to those obtain in P. vivax. We identified of a subset of

Conclusion.
In summary, we assembled the first P. vivax--like reference genomes, the closest sister clade to human P. vivax, which is an indispensable step in the development of a model system for a better understanding of this enigmatic species. We validated that P.
vivax--like parasites form a genetically distinct clade from P. vivax. Concerning the relative divergence dating, we estimated that the divergence between both species occurred three times more recently than the split between P. ovale wallikeri and P. ovale  Table 1 malariae--like 14 were identified using OrthoMCL v2.09 48,49 . From those, we extracted different sets of one--to--one orthologues for the subsequent analyses: a set of 4,056 genes that included the one--to--one orthologues among the four restricted species, P.
vivax, P. vivax--like, P. cynomolgi and P. knowlesi, and a set of 2,352 among the 13 Plasmodium species considered here for the interspecies phylogenetic analysis.
Amino acid sequences of the one--to--one orthologues were aligned using MUSCLE 50 .
Prior to aligning codon sequences, we removed the low complexity regions identified on a nucleotide level using dustmasker 51 and then in amino acid sequences using segmasker 52 from ncbi--blast. After MUSCLE alignments 50 , we finally excluded poorly aligned codon regions using Gblocks default parameters 53 .
SNP discovery and annotation. SNPs were called independently for all 11 P. vivax--like and 19 P. vivax samples by first mapping the samples against the P. vivax PvP01 reference genome using SMALT and then calling SNPs using Samtools mpileup v. 0.1.9 (parameters -q 20 --Q 20 --C 50) followed by bcftools (call --c --V indels). SNPs were filtered using VCFTools (----minDP 5 -max--missing 1). malariae--like 14 . From the proteins of the 12 genomes, low complexity regions were excluded with SEG filter, using default parameters 56 . After an all--against--all BLASTp falciparum was also dated based on the P. malariae and P. malariae--like split estimation 14 . However, this will need to be confirmed with other methods, because the GC content could bias this estimation. Indeed, the models used in our study assume a strict molecular clock, which would not apply to all Plasmodium species, specifically for P. falciparum because of its extreme GC content in comparison to other Plasmodium species.
Phylogenetic tree of P. vivax and P. vivax--like strains. We constructed for Figure  3 a maximum--likelihood tree using the filtered variant call set of SNPs limited to the higher allelic frequency genotypes identified within each sample using RAxML and PhyML (using general--time reversible GTR models) 58,60 . Trees were visualized using Geneious software 46 . All approaches showed the same final phylogenetic tree described in the results section.
Genome wide nucleotide diversity. For the P. vivax and P. vivax--like populations, we calculated the genome--wide nucleotide diversity (π) 32 using VCFTools 61 . The nucleotide diversity was compared between P. vivax and P. vivax--like species based on the Wilcoxon--Mann--Whitney non--parametric test. with a free--ratio model of evolution 59 .

Detection of genes under selection. In order to identify genomic regions involved in
Data availability. All sequences are being submitted to the European Nucleotide Archive. The accession numbers of the raw reads and assembly data will be found in Supplementary Table 2. As the assemblies are private, they will be available on request.