Elucidating relationships between P.falciparum prevalence and measures of genetic diversity with a combined genetic-epidemiological model of malaria

There is an abundance of malaria genetic data being collected from the field, yet using these data to understand the drivers of regional epidemiology remains a challenge. A key issue is the lack of models that relate parasite genetic diversity to epidemiological parameters. Classical models in population genetics characterize changes in genetic diversity in relation to demographic parameters, but fail to account for the unique features of the malaria life cycle. In contrast, epidemiological models, such as the Ross-Macdonald model, capture malaria transmission dynamics but do not consider genetics. Here, we have developed an integrated model encompassing both parasite evolution and regional epidemiology. We achieve this by combining the Ross-Macdonald model with an intra-host continuous-time Moran model, thus explicitly representing the evolution of individual parasite genomes in a traditional epidemiological framework. Implemented as a stochastic simulation, we use the model to explore relationships between measures of parasite genetic diversity and parasite prevalence, a widely-used metric of transmission intensity. First, we explore how varying parasite prevalence influences genetic diversity at equilibrium. We find that multiple genetic diversity statistics are correlated with prevalence, but the strength of the relationships depends on whether variation in prevalence is driven by host- or vector-related factors. Next, we assess the responsiveness of a variety of statistics to malaria control interventions, finding that those related to mixed infections respond quickly (∼months) whereas other statistics, such as nucleotide diversity, may take decades to respond. These findings provide insights into the opportunities and challenges associated with using genetic data to monitor malaria epidemiology.


Implementation
In forward-dream, time is continuous and the occurrence of events is dictated by a competing Poisson process. The rate at which any event occurs, r total , is equal to: which is the sum of the total biting rate across all vectors, the clearance rate across all infected hosts and the clearance rate across all infected vectors. The time until the next event is drawn from an exponential distribution with an expectation of 1/r total . Conditional on an event occurring, the type of event is sampled proportional to its fractional rate. For example, the probability of a given event being a biting event is: If a biting event occurs, the host and vector involved are sampled at random from the population, and the nature of parasite transmission depends on their infection status. Similarly, for clearance events, the host or vector to be cleared of infection (i.e. returned to the susceptible state) is chosen at random.

Parameterisation
forward-dream is completely specified by 17 parameter values (see Table 1 of the main text). Below we discuss how these parameters were set for the simulations presented in this study.

Biting rate (b)
The biting rate (b) gives the daily rate at which a single vector bites a host, such that 1/b gives the average time between bites. It is typically stated that 1/b ranges from 2-4 days, and common in modelling papers is to give an assumed value of 1/b = 3. This is supported by some laboratory studies; for example [1] observed an approximately 3 day feeding and oviposition behaviour of Anopheles gambiae when offered daily blood meals and kept alone.
In forward-dream we set the default biting rate b = 0.25 which implies an average of 4 days between bites. We elected for a slightly lower default biting rate to partially compensate for the absence of an incubation time within the vector; bringing the average number of infectious bites delivered by a vector closer to what it would be were an incubation time included.
Vector-to-host (π h ) and host-to-vector (π v ) transmission efficiencies The vector-to-host transmission efficiency (π h ) is the probability that a host becomes infected after being bitten by an infectious vector. An assembly and reanalysis of multiple sources of information on π h has been conducted by [2]. The most direct source of information on transmission efficiency they identify comes from the control arm of drug or vaccine trials, where immunologically naive individuals are challenged with mosquitoes carrying P. falciparum [2]. Using data of this kind, [2] estimate a vector-to-host transmission efficiency of 55%. A second source of information comes from simultaneously collecting data on the force of infection (F OI, the rate at which hosts become infected per unit time) and the entomological inoculation rate (EIR, the average number of infectious bites received per host per unit time). The ratio F OI EIR is then taken to equal the vector-to-host transmission efficiency [2]. A study of this type conducted in Saradidi, Kenya, estimated the vector-to-host transmission efficiency in children to be 7.5% [3]. Reanalysis by [2], fitting three different mathematical models (unrooted linear, linear, and a model accommodating for heterogeneous biting) to observations of F OI EIR collected through time, produced estimates of 5%, 2% and 27% (for the three models, respectively). A key conclusion of [2] was that transmission efficiency declines at higher transmission intensities (in particular when EIR > 10), and that this trend can be fit well using a heterogeneous biting model with π h = 0.55. However, when models without heterogeneous biting are used, π h = 0.55 produces F OI values that are significantly above observations; lower values of transmission efficiency provide a better fit to the data.
As forward-dream does not model heterogeneous biting, we set the default vector-to-host transmission efficiency to 10% (π h = 0.1), roughly in the middle of the raw estimates reported by [2].
The host-to-vector transmission efficiency (π v ) is the probability that a vector becomes infected after biting an infectious host. Analysing 23-years of longitudinal data collected on the prevalence of infection in both hosts and vectors in Dielmo, Senegal, [4] produced estimates of π v . Similarly to the vector-to-host transmission efficiency, [4] found that the π v was lower in periods of higher transmission: ranging from ∼ 0.18 at a parasite prevalence (P R) of 0.2, to ∼ 0.03 at P R = 0.7 (for Anopheles gambiae s.l.).
In forward-dream we set the default host-to-vector transmission efficiency to 10% (π h = 0.1).

Host infection clearance rate (γ)
The host infection clearance rate, γ, is the daily rate at which hosts in the infected state (H 1 ) return to the susceptible state (H 0 ). The host infection clearance rate is modelled as a constant, and so the duration of host infection is exponentially distributed with a mean of 1/γ. Recall that in forward-dream, both singlyand multiply infected hosts are in the H 1 state. As γ is the same for all hosts in the H 1 state, multiple infection has no effect on the duration of host infection in forward-dream; all hosts have the same daily rate of clearance γ, regardless of the composition of their infection. Studies examining the duration of untreated P. falciparum infections are scarce on account of the medical obligation to treat [5]. In the mid-1900s in the USA, the recommended treatment for individuals with neurosyphilis was deliberate infection with P. falciparum malaria, and analysis of the progression of these infections has become an important source of information about untreated infection duration. Seeking to understand both the average duration of untreated infection and the best fitting distributional form, [5] analysed data from 54 such malariatherapy patients who were inoculated with P. falciparum either by sporozoites or infected blood, and were subsequently followed by daily microscopic examination. They observed a mean infection duration of 211.6 days [5]. It is also notable that they found that both the Gompetz and Weibull distributions provided better fits than the exponential model [5]. This data source has significant caveats, however, including that the adults involved were all naïve and co-infected with syphilisis, and the strains used were selected for having a low clinical virulence and the greatest curative properties [9]. Ultimately, the extent to which these infections reflect those developed in natural settings is unclear.
A second source of data on untreated infections comes from a study conducted in the Kassena-Nankana District of Northern Ghana in 2000-2001, a region with very high transmission (EIR > 300) [10]. In this study, approximately 300 asymptomatic individuals positive for malaria were tracked for a ten month period, where every two months blood samples were taken and both microscopic and PCR analysis of the msp2 gene was performed [10]. The analysis of the msp2 gene allowed for the tracking of individual parasite genotypes in a context where most individuals were multiply infected. A variety of authors have analysed this data source using different methods, and estimates of the duration of infection range from as low as 55 to as high as 319 days (see Table 1).
The review by [11] highlights other sources of information on infection duration, including accidental cases of malaria infection by blood transfusion, and provides evidence that in some cases a P. falciparum can last for multiple years.
For forward-dream, we set the default host infection clearance rate γ = 0.005, corresponding to an average duration of infection of 200 days. This is Macdonald's original estimate of infection duration [12], and broadly consistent with the estimates described above.

Vector infection clearance rate ( )
In forward-dream, the vector infection clearance rate, , is the daily rate at which vectors in the infected state (v 1 ) return to the susceptible state (v 0 ). In the Ross-Macdonald model, it is classically assumed that vectors do not clear infections, but rather that the flow from v 1 to v 0 is driven by the death of infected vectors and their (instantaneous) replacement by susceptible vectors (reviewed in [13]). This allows for the vector population size (N v ) to be kept fixed. It further implies that represents the daily probability of death for infected vectors. As is a constant, the duration of vector lifespans is exponentially distributed with a mean of 1/ . Consequently, vectors do not have an increased probability of death with age (they do not experience senescence); the common justification for this is that mosquitoes die too rapidly of other causes to die as a consequence of aging.
To set , we looked for data on either lifespan or the daily survival rate (commonly denoted p) of Anopheles mosquitoes from the field. Since we are looking to parameterise infected mosquitoes, we are particularly interested in data on female Anopheles mosquitoes. We note that there are a large variety of factors that influence mosquito survival rate in nature -including the climate/weather, most pertinently the temperature and humidity, the species and activities of the mosquito, and the presence of parasites and predators -and that these factors likely vary substantially through time and across different geographies [14]. Thus, in malaria endemic areas there is likely variation in through space in time.
A relatively straightforward approach to estimating the lifespan of Anopheles mosquitoes is by breeding them in controlled laboratory environments or outdoor cages. However, estimates of this kind are generally thought to be inflated, as many common sources of mortality that would act on wild vectors are not in effect [15]. [1] found that the maximum lifespan of a laboratory colony A. gambiae mosquitoes, fed on a 10% sucrose solution, was 16 days at 27 • C and 34 days at 22 • C (p = 0.94 and p = 0.97, respectively). Given that mortality in laboratory experiments is driven primarily by senescence, whereas in the Ross-Macdonald model it is implicitly assumed that mortality due to senescence is negligible, we consider these estimates unsuitable for parameterising forward-dream.
In general, the lifespan or survivorship of wild vectors cannot be measured directly [14]. However, a multitude of methods have been developed to provide indirect estimates. Below we briefly outline two general types of estimation approaches, and in Table 2 we provide a selection of estimates arising from them.
For a more detailed discussion and additional estimates, the reader should consult Chapter 13 of [16] which covers the subject extensively.  The first type of approach is based on Mark-release-recapture experiments. Here, mosquitoes are marked, originally with paint or a radioisotope, and released in a natural environment [22]. It has been observed that the number of marked mosquitoes recaptured on every subsequent day declines, and it assumed this is due to mortality [22]. From this decline, daily mortality can be estimated in a variety of ways (reviewed in [16]). This type of approach can be challenged by low recapture numbers, changes in mosquito survival or behaviour on account of the marking, and the confounding of mortality with other reasons for mosquito loss, such as dispersal.
A second approach looks for indicators of the age-composition of the mosquito population, typically with respect to developmental stages. For example, whether a mosquito is parous (has laid eggs) or nulliparous (has not laid eggs) can be estimated by dissection [23]. If an assumption is made about the duration of time until egg laying, the daily survival rate can be estimated by the proportion parous [23]. A great diversity of methods of this type exists, focusing different age-composition indicators (reviewed in [16]). In general, they all involve making assumptions about (and are sensitive to) the duration of the developmental stage(s) under consideration.
Overall, estimates of the survival rate (p) of Anopheles mosquitoes typically fall between 0.75 and 0.95 (see Table 2). For forward-dream, the vector infection clearance rate is = 1 − p and we set the default value at 0.2, corresponding to a survival rate of 0.8.

Number of sub-compartments in hosts (n h ) and vectors (n v )
An individual host can harbour up to 10 9 -10 11 P. falciparum parasites [24]. However, in all the simulations described here, n h (and n v ) are set to twenty. This is motivated primarily by reducing computational costs, but the simplification can also be justified biologically. First, even if an individual host contains 10 9 -10 11 parasites, P. falciparum replicates clonally within the host and so the majority of these will have near-identical genomes. A host may, however, be infected by multiple distinct parasite strains, generating a mixed infection. Importantly, setting n h = 20 still accommodates for this possibility. Second, parasite genomes that exist at a frequency much less than 1/20 of the population are unlikely to produce substantial signal in genetic data. For example, with whole-genome sequencing data, sequencing depths are typically ≤ 100X and so parasites comprising less than 1/20 of the infection will be represented by only a handful of reads.

Mutation and drift rates in the host (θ h , d h ) and vector (θ v , d v )
We have combined the mutation and drift rate parameterisation into a single section because in forward-dream mutational events occur conditional on drift events, and thus the parameters are related. In particular, considering hosts, drift events occur at a fixed constant rate d h such that for any time interval δt the number of drift events will be drawn from ∼ P oi(δtd h ). Note that forward-dream implements a Moran model, such that for each drift event two sub-compartments are selected at random, and one is duplicated while the other is deleted. With probability θ h N snps , each of these events will be instead a mutational event. The same occurs for vectors with rates d v and θ v .
The malaria parasite genome in forward-dream is designed to represent a collection of SNPs, and there are at least two sources of information on the single-nucleotide variant (SNV) mutation rate of P. falciparum. The first is from clone-tree experiments, where P. falciparum strains are cultured in vitro and accumulated mutations are detected by whole-genome sequencing. From these experiments several estimates of the mutation rate of P. falciparum during the erythrocytic phase have been produced, typically on the order of ∼ 10 −10 [25; 26]. A second approach is to look at sequence divergence from a related Plasmodium species, such as Plasmodium riechenowi [27]. Here, the estimated mutation rate corresponds to the entire generation of P. falciparum, i.e. including both the host and vector phases. We have tabulated a selection of these mutation rates below. We are unaware of any mutation rate estimates specific to the vector phase.  Table 3: A selection of mutation rate estimates for P. falciparum. The two different estimates from [27] arise from differing assumptions about the speciation time between P. falciparum and P. reichenowi (5Mya or 7Mya.). Note that [28] estimates are an order of magnitude higher, but have been adjusted to include unobserved lethal mutations.

SNV
It was unclear to us how to use these mutation rates to parameterise forward-dream. Given that n h = 20 and n v = 20, each sub-compartment can be thought of as representing the consensus sequence of 1/20th of the parasite population, rather than a single genome. This implies that a mutational event in forward-dream is equivalent to an event whereby a novel variant reaches a frequency 1/20, within the host/vector. Note that this is not co-incident with the mutation occurring, but is related also to the rate of drift. In addition, though n h and n v are fixed in forward-dream, in reality the size of the parasite population fluctuates dramatically in both the vectors and the hosts over the course of an infection -going through exponential growth phases and bottlenecks in response to immune pressure.
For simplicity, we set the host and vector mutation and drift rates equally, to d h = 2, d v = 2 and θ h = 0.0000125, θ v = 0.0000125. Thus, for both hosts and vectors, a drift event occurs on average twice every day and, with N snps = 8000, one in every ten drift events produces a mutation. Since there are ten subcompartments in both the hosts and the vectors, this implies that an individual sub-compartment acquires a mutation once every one-hundred days. In the future, estimates of these parameters can be strengthened by the collection of longitudinal WGS data for P. falciparum.
Probability a given parasite genome passes through the host-to-vector (p h ) or vector-to-host bottleneck (p v ) The two parameters p h and p v control the size of the bottleneck when parasites are being transmitted from the host to the vector, or vector to the host, respectively. In terms of the forward-dream model, p v gives the probability that a given sub-compartment in the vector is selected to colonize a host sub-compartment (see main text). Biologically, we expect it to be determined by the processes that reduce the amount of genetic diversity found in the initial blood-stage infection, relative to that present in the salivary glands of the vector that initiated that infection. This would include the number of sporozoites that are transmitted to the host, how representative they are of the genetic diversity within the salivary glands, and what fraction of them succeed in invading hepatocytes.
By dissecting wild caught Anopheles gambiae salivary glands and quantifying the abundance of sporozoitespecific antigens (either with an ELISA or immunoradiometric assay), the average number of sporozoites within an infectious vector has been found to be around 4000 to 5000, although the distributions are highly variable with some mosquitoes found carrying tens of thousands of sporozoites [29; 30]. Estimates of the number of sporozoites delivered in a single bite are also variable. Using a membrane feeding assay, [31] found that a geometric mean of 4.9 sporozoites were delivered by An. freeborni and 11.3 sporozoites were delivered by An. gambiae mosquitoes. In rodent model systems, two studies found an average of 123 and 281 Plasmodium berghii sporozoites were injected in a single bite, though again the distributions were highly variable with over a thousand sporozoites delivered in some cases [32; 33]. Cumulatively these results suggest that, typically, only a small percentage of salivary gland sporozoites are delivered to the host [31]. Finally, analysing early blood stage infections, [34] deduced that roughly 20% of sporozoites injected into the host reach the liver stage and ultimately produce merozoites; further emphasising that only a very small percentage of all salivary gland sporozoites establish infection within the host.
However, we note that the numerical reduction in number of sporozoites may be significantly greater than the amount of genetic diversity that is lost, as this is critically dependent on the amount of genetic diversity in the original salivary gland population. Consider that if there are only four unique P. falciparum genomes within the salivary glands (which could occur if the vector harboured a single oocyst), then a minimum of only 4 sporozoites is needed to re-establish 100% of genetic diversity within the host. This would be possible even if only tens to hundreds of sporozoites were transmitted. Furthermore, we note that the rate of co-infection in P. falciparum has been found to be high: globally it is nearly equivalent to the rate of super-infection in mixed infections carrying two strains [35], and there is evidence that co-infection occurs regularly in high transmission settings [36], observations that would be unlikely if a substantial fraction of diversity was lost with transmission.
As a starting point, we set p v to 0.2 and p h to 0.2; though the biological processes impacting the value of p h differ (it concerns the fraction of gametocytes that are imbibed by a vector), it is subject to similar uncertainties as p v . As a consequence (and with n h and n v set to 20), a mean of four parasite genomes will pass through the bottleneck; though we note that in ∼ 40% of cases it will be less than four. Studies analysing the genetic diversity present within the vector midgut and salivary glands could strengthen the estimates of these parameters.

Distribution of number of oocysts (p oocysts )
As described in the main text, we model the number of oocysts that are generated during meiosis in an individual mosquito by a truncated geometric distribution: N oocysts ∼ max[Geo(p oocysts ), 10] p oocysts therefore represents the probability of having only a single oocyst, and we restrict the maximum number of oocysts to ten. The number of oocysts in wild-caught Anopheles mosquitoes can be determined by dissection. We used two references to set p oocysts : (i) [29] caught Anopheles gambiae s.l. near Kisumu in Kenya, and Fig. 2 of that reference suggests an approximately geometric distribution with p = 0.5; (ii) [30] caught Anopheles gambiae s.l. near Basse in The Gambiae, and Fig. 4 of that reference appears roughly geometric, with a p ≤ 0.5. Consequently, we set p oocysts = 0.5 in forward-dream.