Quantitative Seq-LGS – Genome-Wide Identification of Genetic Drivers of Multiple Phenotypes in Malaria Parasites

ABSTRACT Identifying the genetic determinants of phenotypes that impact on disease severity is of fundamental importance for the design of new interventions against malaria. Traditionally, such discovery has relied on labor-intensive approaches that require significant investments of time and resources. By combining Linkage Group Selection (LGS), quantitative whole genome population sequencing and a novel mathematical modeling approach (qSeq-LGS), we simultaneously identified multiple genes underlying two distinct phenotypes, identifying novel alleles for growth rate and strain specific immunity (SSI), while removing the need for traditionally required steps such as cloning, individual progeny phenotyping and marker generation. The detection of novel variants, verified by experimental phenotyping methods, demonstrates the remarkable potential of this approach for the identification of genes controlling selectable phenotypes in malaria and other apicomplexan parasites for which experimental genetic crosses are amenable. Significance Statement This paper describes a powerful and rapid approach to the discovery of genes underlying medically important phenotypes in malaria parasites. This is crucial for the design of new drug and vaccine interventions. The approach bypasses the most time-consuming steps required by traditional genetic linkage studies and combines Mendelian genetics, quantitative deep sequencing technologies, genome analysis and mathematical modeling. We demonstrate that the approach can simultaneously identify multigenic drivers of multiple phenotypes, thus allowing complex genotyping studies to be conducted concomitantly. This methodology will be particularly useful for discovering the genetic basis of medically important phenotypes such as drug resistance and virulence in malaria and other apicomplexan parasites, as well as potentially in any organism undergoing sexual recombination.


INTRODUCTION
Malaria parasite strains are genotypically polymorphic, leading to a diversity of phenotypic characteristics that impact on the severity of the disease they cause. Discovering the genetic bases for such phenotypic traits can inform the design of new drugs and vaccines.
Experimental studies of the genetic basis of particular phenotypes have previously relied on linkage or association mapping. Linkage mapping studies involve the crossing of two inbred (or clonal) parental stains to produce one or more recombinant progenies, from which individual recombinant offspring are isolated and typed for the phenotype under investigation, while in association mapping studies unrelated individuals are sampled and phenotyped (1). Both approaches require the genotyping of individual progeny through the use of genetic markers, and in the case of linkage mapping, this requires the cloning of progeny following the production of a genetic cross. Prior to the advent of high-throughput deep sequencing using second and third generation sequencing technologies, the identification of loci associated with particular phenotypes relied on the development of large numbers of molecular markers through approaches such as restriction fragment length polymorphism (RFLP) (2), amplified fragment length polymorphism (AFLP) (3) or microsatellite markers (4).
Both association mapping and linkage analyses approaches have been adopted to understand the genetic mechanisms behind various phenotypes in malaria (5)(6)(7)(8)(9) and with the application of whole genome sequencing (WGS), the resolution of these methodologies has been dramatically improved, allowing the discovery of selective sweeps as they arise in the field (10).
However, both approaches suffer from drawbacks when working with malaria parasites: linkage mapping requires the cloning of individual recombinant offspring, a process that is both laborious and time-consuming, although the recent combination of genetically engineered parasites and flow cytometry has shown great potential to address the issue (11).
Association studies, on the other hand, require the collection of a large number of individual parasites (usually in the thousands) from diverse geographical origins and over periods of several months or years to produce enough resolution for the detection of selective sweeps. This approach also requires a considerable amount of manpower and resources.
Linkage Group Selection (LGS), like linkage mapping, relies on the generation of genetic crosses, but bypasses the need for extracting and phenotyping individual recombinant clones.
Instead, it relies on quantitative molecular markers to measure allele frequencies in the recombinant progeny and identify loci under selection (12,13). It has been successfully applied in studying strain-specific immunity (SSI) (14, 15), drug resistance (12,16) and growth rate (Pattaradilokrat et al. 2009) in malaria and SSI in Eimeria tenella (17). More recently, LGS also benefited from the advent of quantitative WGS, facilitating the generation of molecular markers (the SNPs identified during WGS) and improving the resolution and accuracy of the approach (18).
An approach similar to the above, applied to yeast, has also shown that complex trait phenotypes, such as heat tolerance, can be analyzed, exploiting the possibility of collecting timeresolved sequence data from such a system (19,20). However, in an experimental malaria system, collection of data is not so straightforward.
LGS also lacks a formal mathematical methodology to distinguish selective sweeps (or "selection valleys") from false positive signals (e.g. sudden shifts in allele frequency within a chromosome due to clonal growth in the cross population, or simply random background noise) or to accurately define the boundaries of regions of the genome under significant selection.
In the present study, LGS has been applied to a genetic cross between two strains of the rodent malaria parasite Plasmodium yoelii that display two discernible phenotypes, namely a difference in growth rate (with strain 17X1.1pp growing faster than strain CU) and the ability to induce protective immunity against homologous challenges with the immunizing strain. A sophisticated mathematical model, built upon methodological developments in the analysis of genetic cross populations (20), was developed to analyze the data.
This modified LGS approach relies on the generation and independent selection of two independent crosses between the same parental clone parasites that differ in the phenotype under investigation. The progeny from both crosses pre-and post-selection are then subjected to high The process starts with the identification of distinct selectable phenotypes in cloned strains of the pathogen population (in this case malaria parasites) and their sequencing, usually from the vertebrate blood stage. (B) Multiple genetic crosses between two cloned strains are produced, in this case inside the mosquito vector. (C) Each individual cross progeny is grown with and without selection pressure(s). Selection pressure will remove those progeny individuals carrying allele(s) associated with sensitivity to the selection pressure(s), while allowing progeny individuals with the resistant allele(s) to survive. DNA is then extracted from the whole, uncloned progeny for sequencing. SNPs distinguishing both parents are used to measure allele frequencies in the selected and unselected progenies. (D) A mathematical model is applied to identify and define selective sweeps. Selective sweep regions are then analyzed in detail to identify potential target mutations underlying the phenotype(s) being studied. Localized, capillary sequencing can be employed to verify or further characterize mutations. (E) Finally, where applicable, transfections studies, allele replacement experiments or other approaches can be carried out to confirm the effect of target mutations.
Applying this approach to crosses between two strains of P. yoelii that induce SSI, and which differ in their growth rates, we were able to identify three genomic regions containing genes controlling both phenotypes. Two genomic selection valleys were identified controlling SSI, and n to s) h RESULTS Characterization of strain specific immunity and growth rate phenotypic differences between CU and 17X1.1pp.
The difference in growth rate between the blood stages of the two clones was followed in vivo for nine days in CBA mice. A likelihood ratio test using general linear mixed models indicated a more pronounced growth rate for 17X1.1pp compared to CU (clone by time interaction term, L = 88.60, df = 21, P < 0.0001, Figure 2A). To verify that the two malaria clones could also be used to generate protective SSI, groups of mice were immunized with 17X1.1pp, CU or mock immunized, prior to challenge with a mixture of the two clones. The relative proportions of the two clones were measured on day four of the infection by real time quantitative PCR targeting the polymorphic msp1 locus (21). A strong, statistically significant SSI was induced by both parasite strains in CBA mice ( Figure 2B). Two kinds of selection pressure were applied in this study: growth rate driven selection and SSI.
Two independent genetic crosses between 17X1.1pp and CU were produced, and both these crosses subjected to immune selection (in which the progeny were grown in mice made immune to either of the two parental clones), and grown in non-immune mice. Progeny were harvested from mice four days after challenge, at which point strain specific immune selection in the immunized mice, and selection of faster growing parasites in the non-immune mice had occurred. Using deep sequencing by Illumina technology, a total of 29,053 high confidence genome-wide SNPs that distinguish the parental strains were produced by read mapping with custom-made Python scripts.
These were filtered using a likelihood ratio test to remove sites where alleles had been erroneously mapped to the wrong genome location. Next a "jump-diffusion" analysis was applied to identify allele frequency changes (Supplementary Table S1 A total of five potential selective sweeps were detected among all the samples ( Table 1). Of these, two were detected in both replicates and across multiple progeny samples ( Figure 3A).  The maximum likelihood location of the selected allele by chromosome genome coordinate. 3 The combined interval was produced by combining the likelihoods calculated for each replica, with selected alleles having the same positions in each replica, ignoring biological noise. although a corresponding increase in CU-allele frequencies was also apparent ( Figure 3A).
The remaining selection valleys (on Chrs VIII and XIII) were only detected in a single sample and in a single replicate ( Table 1) and were thus considered to be non-significant.

Potential target genes within the three main selection valleys.
The selection valley associated with SSI on Chr VIII contains 41 genes of which one, msp1, is a well characterized major antigen of malaria parasites that has been proposed as a vaccine candidate (22) and has been previously linked to SSI in Plasmodium chabaudi (14, 15). No other gene in this selection valley matched the profile of a potential antigen (based on functional description, presence of TM domains and/or signal peptides).
The growth rate associated selection valley on Chr XIII contains 29 genes, including Pyebl, a gene previously implicated in growth rate differences between strains of P. yoelii (24,25).
Finally, the minor selection valley on Chr VII consists of 21 genes with several genes bearing signatures associated with a potential antigenic role.
Characterization of EBL as the major driver of growth rate differences through alleleic replacement.
Due to its location in the selection valley associated with growth rate selection and its previous identification as a gene underlying multiplication rate differences in P. yoelii, Pyebl was selected for further analysis. Sanger capillary sequencing re-confirmed the existence in 17X1.1pp of an amino acid substitution (Cys -> Tyr) at position 351 within region 2 of the gene. When aligned against other P. yoelii strains and other Plasmodium species, this cysteine residue is highly conserved, thus making the substitution that arose in 17X1.1pp unique (Figure 4). Crucially, no other mutations were detected in the coding sequence of the gene, including in region 6, the location of the SNP previously implicated in parasite virulence in other strains of P. yoelii (23).   In order to study the role of the mutation, the Pyebl gene locus of slow growing CU and faster growing 17X1.1pp clones were replaced with the alternative allele, as well as with the wildtype allele. To establish whether the C351Y substitution affected EBL localization, as was shown for the previously described region 6 mutation, IFA was performed. This revealed that, unlike the mutation in region 6 described previously (23), the EBL proteins of 17X1.1pp and CU were both found to be located in the micronemes (Figure 5 and Supplementary Figure S2). Transgenic clones were grown in mice for 10 days alongside wild-type clones. Pair-wise comparisons between transgenic clones showed that allele substitution could switch growth phenotypes in both strains (Figure 6A and 6B). RNA-seq analysis revealed that transfected gene alleles were expressed normally (Supplementary Figure S3). This confirmed the role of the C351Y mutation as underlying the observed growth rate difference. (EBL-351Y) allele produces a significantly increased growth rate in the CU strain (CU-EBL-351C>C vs CU-EBL-351C>Y: p < 0.01, Two-way ANOVA with Tukey post-test correction) that is not significantly different from 17X1.1pp growth rate following transfection with its native allele (17X1.1pp-EBL-351Y>Y vs. CU-EBL-351C>Y: p > 0.05, Two-way ANOVA with Tukey post-test correction). Conversely, transfection with the CU (EBA-351C) allele significantly reduces growth (17X1.1pp-EBL-351Y>Y vs 17X1.1pp-EBL-351Y>C: p < 0.01, Two-way ANOVA with Tukey posttest correction) and produces a phenotype that is not significantly different from CU transfected with its own allele (CU EBL-351C>C vs 17X1.1pp-EBL-351Y>C: p > 0.05, Two-way ANOVA with Tukey post-test correction).

DISCUSSION
The development of LGS has facilitated functional genomic analysis of malaria parasites over the past decade. In particular, it has simplified and accelerated the detection of loci underlying selectable phenotypes such as drug resistance, SSI and growth rate (12,14,24). Here we present a radically modified LGS approach that utilizes deep, quantitative WGS of parasite progenies and the respective parental populations, multiple crossing and mathematical modeling to identify selection valleys. This enables the accurate definition of loci under selection and the identification of multiple genes driving selectable phenotypes within a very short space of time. This modified approach allows the simultaneous detection of genes underlying multiple phenotypes, including those with a multigenic basis.
We applied this modified LGS approach to a study of SSI and growth rate in P. yoelii, a rodent malaria parasite. We were able to identify three selection valleys that contained three strong candidate genes controlling both phenotypes. Two selection valleys were implicated in SSI; the first time LGS has identified multigenic drivers of phenotypic differences in malaria parasites.
Differences in growth rates between the two clones were associated with a single selection valley containing Pyebl, which was then shown by the allelic replacement to be the gene controlling this phenotype.
SSI was shown to be controlled by genes within two selection valleys, the strongest on Chr VIII and the other on Chr VII. The Chr VIII valley contained the gene encoding MSP1, a well characterized surface antigen that has been the basis of several vaccine studies 23 , and which has previously been implicated in SSI in P. chabaudi (14, 15). A secondary putative selection valley was detected on Chr VII, which contained several genes with the signature of a potential antigen and that will require further analysis.
The two parental clones CU and 17X1.1pp differ in their growth rates. This is due to the ability of 17X1.1pp to invade both reticulocytes and normocytes, while CU is restricted to reticulocytes (21). A selection valley on Chr XIII was linked to the growth phenotype, and included Pyebl, a gene previously implicated in growth rate differences between different strains of P. yoelii, due to a mutation in Region 6 of the gene that altered its trafficking, so that the protein was located in the dense granules rather than the micronemes (23, 24). Direct sequencing of the pyebl gene of 17X1.1pp and CU revealed a SNP in region 2 of the gene, whereas region 6 displayed no polymorphism. Consistent with this, the EBL proteins of 17X1.1pp and CU were both located in the micronemes, indicating that protein trafficking was unaffected by the region 2 mutation. Allelic replacement of the parasite strains with the alternative allele resulted in a switching of the growth rate to that of the other clone, thus confirming the role of the mutation.
Region 2 of the PyEBL orthologues of P. falciparum and Plasmodium vivax (25)(26)(27) is known to interact with receptors on the red blood cell (RBC) surface. Furthermore the substitution falls within the central portion of the region, which has been previously described as being the principal site of receptor recognition in P. vivax (27). Wild-type strains of P. yoelii (such as CU) preferentially invade reticulocytes but not mature RBCs, whereas highly virulent strains are known to invade a broader repertoire of RBCs (28). It is thus likely that the C351Y mutation directly affects binding to the host receptor, which in turn may affect RBC invasion preference.
In summary, qSeq-LGS offers a powerful and rapid methodology for identifying genes or non-coding regions controlling important phenotypes in malaria parasites and, potentially, in other apicomplexan parasites. Through bypassing the need to clone and type hundreds of individual progeny, and by harnessing the power of genetics, genomics and mathematical modeling, genes can be linked to phenotypes with high precision in a matter of months, rather than years. Here we have demonstrated the ability of qSeq-LGS to identify multiple genetic drivers underlying two independent phenotypic differences between a pair of malaria parasites; growth rate and SSI. This methodology has the potential power to identify the genetic components controlling a broad range of selectable phenotypes, and can be applied to studies of drug resistance, transmissibility, virulence, host preference, etc., in a range of apicomplexan parasites.

Parasites, mice and mosquitoes.
P. yoelii CU (slow growth rate) and 17X1.1pp (intermediate growth rate) strains were maintained in CBA mice (SLC Inc., Shizuoka, Japan) housed at 23°C and fed on maintenance diet with 0.05% para-aminobenzoic acid (PABA)-supplemented water to assist with parasite growth. Anopheles stephensi mosquitoes were housed in a temperature and humidity controlled insectary at 24°C and 70% humidity, adult flies being maintained on 10% glucose solution supplemented with 0.05% PABA.

Testing parasite strains for growth rate and SSI.
Parasite strains were typed for growth rate in groups of mice following the intravenous inoculation of 1x10 6 iRBCs of either CU, 17X1.1pp or transfected clones per mouse and measuring parasitaemia over 8-9 days. In order to verify the existence of SSI between the CU and 17X1.1pp strains, groups of five mice were inoculated intravenously with 1x10 6 iRBCs of either CU or 17X1.1pp parasite strains. After four days, mice were treated with mefloquine (20mg/kg/per day, orally) for four days to remove infections. Three weeks post immunization, mice were then challenged intravenously with 1x10 6 iRBCs of a mixed infection of 17X1.1pp and CU parasites. A group of five naive control mice was simultaneously infected with the same material. After four days of growth 10 μ l of blood were sampled from each mouse and DNA extracted. Strain proportions were then measured by Real Time Quantitative PCR (RTQ-PCR) using primers designed to amplify the MSP-1 gene (29). All measurements were plotted and standard errors calculated using the Graphpad Prism software (v6.01) (http://www.graphpad.com/scientificsoftware/prism/). Wilcoxon rank sum tests with continuity correction to measure the SSI effect were done in R (30). Linear mixed model analysis and likelihood ratio tests to test parasite strain differences in growth rate were performed on log-transformed parasitaemia by choosing parasitaemia and strain as fixed factors and mouse nested in strain as a random factor, as described previously (21). Pair-wise comparisons of samples for the transfection experiments were performed using multiple 2-way ANOVA tests and corrected with a Tukey's post-test in Graphpad Preparations of genetic cross. P. yoelii CU and 17X1.1pp parasite clones were initially grown separately in donor mice. These parasite clones were then harvested from the donors, accurately mixed to produce an inoculum in a proportion of 1:1 and inoculated intravenously at 1ൈ10 6 infected red blood cells (iRBCs) per mouse into a group of CBA mice. Three days after inoculation, the presence of gametocytes of both sexes was confirmed microscopically and mice were anesthetized and placed on a mosquito cage containing ~400 female A. stephensi mosquitoes six to eight days post emergence.
Mosquitoes were then allowed to feed on the mice without interruption. Seven days after the blood meal, 10 female mosquitoes from this cage were dissected to examine for the presence of oocysts in mosquito midguts. Seventeen days after the initial blood meal, the mosquitoes were dissected, and the salivary glands (containing sporozoites) were removed. The glands were placed in 0.2-0.4 mL volumes of 1:1 foetal bovine serum/Ringer's solution (2.7 mM potassium chloride, 1.8 mM calcium chloride, 154 mM sodium chloride) and gently disrupted to release sporozoites. The suspensions were injected intravenously into groups of CBA mice in 0.1 mL aliquots to obtain blood stage P. yoelii CUൈ17X1.1pp cross progeny.

Selection of uncloned cross progeny for linkage group selection analysis.
The recombinant progeny from the above cross was subjected to growth rate selection pressure by passage in naïve mice for four days. The resulting population after growth rate selection was referred to as the "growth rate selected cross progeny". For immune selection, mice immunized with blood stage parasites of either P. yoelii CU or 17X1.1pp through exposure and drug cure (as above) were inoculated intravenously with 1×10 6 parasitized-RBC (pRBC) of the uncloned cross progeny, as described above. The blood stage parasites from the cross progeny grown in these mice which had been previously immunized with CU or 17X1.1pp, were designated "CU-immune selected cross progeny" or "17X1.1pp-immune selected cross progeny", respectively. The resulting infections were followed by microscopic examination of thin blood smears stained with Giemsa's solution.

DNA and RNA isolation.
Parental strains and growth rate-or immune-selected recombinant parasites were grown in naïve mice. Parasite-infected blood was passed through a single CF11 cellulose column to deplete host leukocytes, and the genomic DNA (gDNA) was isolated from the saponin-lysed parasite pellet using DNAzol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions. For RNA isolation a schizont-enriched fraction was collected on a 50% Nycodenz solution (Sigma Aldrich) and total RNA was then isolated using TRIzol (Invitrogen).

PCR amplification and sequencing of ebl gene.
The open reading frame of P. yoelii ebl was PCR-amplified from gDNA using KOD Plus Neo DNA polymerase (Toyobo, Japan) with specific primers designed based on the ebl sequence in PlasmoDB (PY17X_1337400). Pyebl sequences of CU and 17X1.1pp strains were determined by direct sequencing using an ABI PRISM 310 genetic analyzer (Applied Biosystems) from PCRamplified products. Sequences were aligned using online sequence alignment software Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) provided by EMBL-EBI.
Whole genome re-sequencing and mapping.
The paired end Illumina data was first quality trimmed using the software Trimmomatic (31).
Illumina sequencing adaptors were first removed from the sequences. Following that, trailing bases from both the 5' and 3' ends with less than Q20 were trimmed. Lastly, reads found with an average base quality of less than Q20 within a window size of four bases were discarded. Only reads found with surviving pair after the trimming were used for mapping with BWA (32) version 0.6.1 using standard options onto the publicly available genome of P. yoelii 17X strain (May 2013 release; ftp://ftp.sanger.ac.uk/pub/pathogens/Plasmodium/yoelii17X/version_2/May_2013/). The SAM alignment files were converted to BAM using the tool Samtools (33). Duplicated reads were marked and removed using the software tool Picard (http://picard.sourceforge.net).

SNP calling.
Genome-wide SNPs were called using the BAM file obtained for the parental CU strain using a previously prepared custom made Python script (18). The script functions as a wrapper for SAMtools mpileup and parses SNPs calls based on mapping quality and Phred base quality scores. In this experiment the values were set at 30 for mapping quality and 20 for base quality.
Also, since the P. yoelii genome is haploid and the parental strains are clonal, only SNPs where the proportion of any minor alleles was less than 30% was retained, to exclude possible sequencing errors or genuine but uninformative SNPs. The script produces a tab-delimited, human readable table that shows the total number of reads for each of the four possible nucleotides at each SNP. SNPS were called on both parental strains. CU SNPs were then filtered against the 17X1.1pp SNPs to remove any shared SNP calls. The remaining CU SNPs were then used as reference positions to measure the number of reads for each nucleotide in the genetic crosses produced in this study through another Python script (18). This script produced a final table consisting of read counts for each nucleotide of the original CU SNPs in every sample.

Selection.
SNP frequencies were processed to filter potential misalignment events. We note that, during the cross, a set of individual recombinant genomes are generated. Considering the individual genome g, we define the function a g (i) as being equal to 1 if the genome has the CU allele at locus i, and equal to 0 if the genome has the 17X1.1pp allele at this locus.
In any subsequent population, the allele frequency q(i) at locus i can therefore be expressed as for some set of values n g , where n g is the number of copies of genome g in the population.
To filter the allele frequencies, we note that each function a g (i) changes only at recombination points in the genome g. As such, the frequency q(i) should change relatively slowly with respect to i. Using an adapted version of code for clonal inference (35), we therefore modelled the reported frequencies q(i) as being (beta-binomially distributed) emissions from an underlying diffusion process (denoted by x(i)) along each chromosome, plus uniformly distributed errors, using a hidden Markov model to infer the variance of the diffusion process, the emission parameters, and an error rate. A likelihood ratio test was then applied to identify reported frequencies that were inconsistent with having been emitted from the inferred frequency x(i) at locus i relative to having been emitted from a distribution fitted using the Mathematica package via Gaussian kernel estimation to the inferred global frequency distribution {x(i)} , measured across all loci; this test filters out reported frequencies potentially arising from elsewhere in the genome.
Next, the above logic was extended to filter out clonal growth. In the event that a specific genome g is highly beneficial, this genome may grow rapidly in the population, such that n g is large. Under such circumstances the allele frequency q(i) gains a step-like quality, mirroring the pattern of a g (i).
Such steps may potentially mimic selection valleys, confounding any analysis.
In order to detect sudden frequency changes arising from clonal growth, a 'jump-diffusion' variant of the above hidden Markov model was applied, in which the allele frequency can change either through a diffusion process or via sudden jumps in allele frequency, modelled as random emissions from a uniform distribution on the interval [0,1]. For each interval (i,i+1) the probability that a jump in allele frequency had occurred was estimated. Where potential jumps were identified, the allele frequency data was split, such that analyses of the allele frequencies did not span sets of alleles with such jumps. The resulting segments of genome were then analyzed under the assumption that they were free of allele frequency change due to clonal behavior.
Inference of the presence of selected alleles was performed using a series of methods. In the absence of selection in a chromosome, the allele frequency is likely to remain relatively constant across each chromosome. Firstly, therefore, a 'non-neutrality' likelihood ratio test was applied to each contiguous section of genome, calculating the likelihood difference between a model of constant frequency x(i) and the variable frequency function x(i) inferred using the jump-diffusion model. Secondly, an inference was made of the position of the allele potentially under selection in each region. Under the assumptions that selection acts for an allele at locus i, and that the rate of recombination is constant within a region of the genome, previous work on the evolution of cross populations (20,35) can be extended to show that the allele frequencies within that region of the genome at the time of sequencing are given by for each locus j not equal to i, where X is the CU allele frequency at the time of the cross, ? is the local recombination rate, ρ ij is the distance between the loci i and j, x is an allele frequency, and e describes the effect of selection acting upon alleles in other regions of the genome. A likelihoodbased inference was used to identify the locus at which selection was most likely to act. In regions for which the 'non-neutrality' test produced a positive result for data from both replica crosses, and for which both the inferred locus under selection, and the direction of selection acting at that locus were consistent between replicas, an inference of selection was made.
For regions in which an inference of selection was made, an extended version of the above model was applied, in which the assumption of locally constant recombination rate was relaxed.
Successive models, including an increasing number of step-wise changes in the recombination rate, were applied, using the Bayesian Information Criterion (36) for model selection. A model of selection at two loci within a region of the genome was also examined.
Given an inference of selection, a likelihood-based model was used to derive confidence intervals for the position of the locus under selection. A combined confidence interval utilized data from both replicates, representing the extent of uncertainty inherent in the model, on the basis of the modelling assumptions. A second, more conservative, interval was calculated as the union of the two confidence intervals calculated when the model was applied separately to data from each replica; this second approach aims to account for 'biological noise', encompassing any deviation in the experimental system from the model assumptions. Full details of the mathematical methods are contained within Supporting Information (Supplementary Protocol 1).

Gene information and dN/dS ratios for P. falciparum orthologues.
For each combined conservative interval of relevant selective sweeps, genes were listed based on the 6.2 PlasmoDB annotation and verified against the current annotation (v26). For each gene, transmembrane domains and signal peptides predicted in PlasmoDB were also indicated. P. falciparum orthologues were also indicated, together with their dN/dS ratios stored in the current version of PlasmoDB.

Allele replacement.
P. yoelii schizont-enriched fraction was collected by differential centrifugation on 50% Nycodenz in incomplete RPMI1640 medium, and 20 µg of ApaI-and StuI-double digested linearized transfection constructs were electroporated to ~1 x 10 7 of enriched schizonts using a Nucleofector device (Amaxa) with human T-cell solution under program U-33 (39). Transfected parasites were intravenously injected into 7-week-old ICR female mice, which were treated by administering pyrimethamine in the drinking water (conc) 24 hours later for a period of 4-7 days. Before inoculation of P. yoelii 17X1.1pp parasite, mice were treated with phenylhydrazine to increase the reticulocyte population in the blood. Drug resistant parasites were cloned by limiting dilution.

Phenotype analysis.
To assess the course of infection of wild type and transgenic parasite lines, 1 x 10 6 pRBCs were injected intravenously into 8-week old female CBA mice. Thin blood smears were made daily, stained with Giemsa's solution, and parasitaemias were examined microscopically.

RNA-Seq.
Whole blood from mice infected with P. yoelii on day 5 post-infection were host WBC depleted and saponin lysed to obtain the parasite pellet. Total RNA was extracted using TRIzol reagent. Strandspecific RNA sequencing was performed from total RNA using TruSeq Stranded mRNA Sample Prep Kit LT according to manufacturer's instructions. Libraries were sequenced on an Illumina HiSeq 2000 with paired-end 100bp read chemistry (ENA: PRJEB15102). RNA-seq reads were mapped onto Plasmodium yoeli 17X version 2 from GeneDB (http://www.genedb.org) using TopHat 2.0.13 (40) and visualized using Artemis genome visualization tool (41).