Estimating the basic reproduction number of a pathogen in a single host when only a single founder successfully infects

If viruses or other pathogens infect a single host, the outcome of infection may depend on the initial basic reproduction number R0, the expected number of host cells infected by a single infected cell. This article shows that sometimes, phylogenetic models can estimate the initial R0, using only sequences sampled from the pathogenic population during its exponential growth or shortly thereafter. When evaluated by simulations mimicking the bursting viral reproduction of HIV and simultaneous sampling of HIV gp120 sequences during early viremia, the estimated R0 displayed useful accuracies in achievable experimental designs. Estimates of R0 have several potential applications to investigators interested in the progress of infection in single hosts, including: (1) timing a pathogen’s movement through different microenvironments; (2) timing the change points in a pathogen’s mode of spread (e.g., timing the change from cell-free spread to cell-to-cell spread, or vice versa, in an HIV infection); (3) quantifying the impact different initial microenvironments have on pathogens (e.g., in mucosal challenge with HIV, quantifying the impact that the presence or absence of mucosal infection has on R0); (4) quantifying subtle changes in infectability in therapeutic trials (either human or animal), even when therapies do not produce total sterilizing immunity; and (5) providing a variable predictive of the clinical efficacy of prophylactic therapies.


Introduction
When viruses or other pathogens infect a single host, the basic reproduction number R 0 is the expected number of cells infected by a single infected cell [1]. The initial R 0 is a fundamental determinant of whether an infecting viral population will establish itself in the host. On one hand, if R 0 < 1, the viral invaders reproduce below replacement and will go extinct. On the other hand, if R 0 is slightly greater than 1, an initial virus has a small positive probability of amplifying into a systemic infection, and if R 0 is large, infection is all but inevitable.
The initial basic reproduction number R 0 is therefore a continuous variable with direct biological pertinence to infection. As such, it may have many underappreciated applications. As a general example, consider therapeutic trials. More specifically, consider as a motivating a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 application pre-clinical tests of HIV therapies like vaccines in animal models. In this context, the macaque model is popular, because simian immunodeficiency virus can cause a disease progression resembling AIDS in humans [2]. Historically, macaque trials often used a single high-dose intravenous or mucosal inoculation to ensure almost certain infection of unprotected animals [3]. The consequent assessment of therapeutic efficacy depends primarily on binary categorical data, i.e., whether or not infection occurred in treated animals [4,5]. New experimental designs, notably repeated low-dose challenges, have improved statistical power in animal trials [3,6], but do not remove the intrinsic statistical limitations of binary data. The ability to estimate a continuous variable like the initial R 0 in HIV infection would in principle permit a statistically more powerful analysis of therapeutic efficacies.
The initial R 0 is likely a primary determinant of whether systemic infection occurs, and initial microenvironments with relatively few target cells probably impede systemic infection by HIV. Consider, e.g., that the number of per-act transmissions per 10,000 exposures varies considerably by route of infection [7]. For sexual exposure, the number ranges from less than 4 to 138; for needle-sharing, it is about 63. In contrast, for vertical transmission between mother and child, the number is 2260; for blood transfusion, it is 9250. Thus, the initial microenvironment may starkly limit the reproduction of HIV until the virus escapes into systemic circulation, where target cells are plentiful.
Unfortunately, practical thresholds for HIV detection make the initial R 0 inaccessible to direct measurement, because on average, viremia is delayed until about 10 days after exposure to HIV [8,9]. Modern techniques for measuring the abundance of HIV [10] have yielded estimates, e.g., R 0 � 6 [11] or R 0 � 8 [12]. These estimates of R 0 pertain to viremia, however, when there are at least 20 viruses/ml (the current lower limit of detectability) and thus about 10 5 viruses in the total blood volume of 5L [13]. By the time HIV is detectable in blood samples and R 0 can be measured directly, infection has long since established itself.
The viral dosage may vary considerably between the modes of transmission above, and in sexual or mother-to-child transmission of HIV, e.g., the genetic diversity at early stages of infection usually reflects the number of viruses founding the infection [14][15][16]. In fact, however, about 80% of all HIV infections arise from a single founding viral sequence [17][18][19]. Moreover, the design of repeated low-dose challenge animal trials is likely to cause infections with a single founding virus. Thus, because of its practical importance, the case of a single founder virus is a convenient starting point for mathematical analysis, and the rest of this article assumes a single founder.
Although most of this article is self-contained, it continues a scientific program started in [20]. Direct measurement of R 0 early in infection may be impractical, but Fig 1 in [20] showed that the initial R 0 displays its footprint in HIV sequences sampled in early viremia, during exponential expansion of the viral population. To describe the essence of Fig 1 in [20], if only two daughters of the founder successfully contribute descendants to the viremia, and one daughter has a novel mutation away from the founder, about half the sequences sampled in viremia have the mutation. In contrast, if the founder has many daughters successfully contributing descendants to the viremia, far fewer than half the sequences sampled in viremia are likely to have any given mutation.
The structure of this article is as follows: the theory section applies standard statistical procedures to yield a robust method for estimating R 0 from sequence data. The Methods section then describes the simulation of a continuous-time branching process whose parameters are pertinent to sampling sequences of the HIV gp120 gene. The process is the "Gamma model" of [20], a special Bellman-Harris process [21] that idealizes HIV reproduction. The Results section displays the accuracy of our estimator in recovering R 0 in the idealized simulation and the Discussion section examines some consequences of the theory. Finally, the Supporting Information compares our estimator to other (inferior) statistical techniques that we applied to recover R 0 from the same idealized simulation.
All approximations in this article are uncontrolled, i.e., we cannot provide bounds on the error that the approximations cause. Usually without comment, therefore, we rely on simulations of the Gamma model to assess their accuracy. Parameter regimes not pertinent to HIV reproduction and the sampling of gp120 sequences require separate assessment and are beyond the immediate practical purview of this article.
Finally, as motivation, in the context of the Gamma model, our estimator compares favorably with state-of-the-art methods. It has an exceptionally simple analytic form, e.g., yielding a negligible computation time in comparison to exact Bayesian calculations. Moreover, under methods analogous to ours, the coalescent process yields the same estimator as continuoustime branching process models of population growth. Under the Gamma model for HIV reproduction, however, the estimator has a singularity, making it useless for quantitation if R 0 � e � 2.7.

Theory
The following set-up assumes that a single founder virus has infected a single host (e.g., a single HIV infects a human). The set-up is mostly self-contained, drawing only on a few critical approximations presented elsewhere with some mathematical foundations [20]. First, before modeling viral ancestry, we examine the sampling of the viral sequences.
With "≔" denoting a definition, define (m) ≔ {1, 2, . . ., m}. Fix an alphabet Λ, e.g., the unambiguous nucleotide alphabet Λ = {a, c, g, t}. In practice, sequence analysis must invoke a strategy for handling anomalous characters in an alignment (e.g., ambiguous nucleotides or gap characters). Sometimes, anomalous characters are infrequent, so that as an acceptable approximation, the analysis can treat them as ordinary characters by enlarging its alphabet. Sometimes, the analysis simply omits columns containing them. Without further comment, the following assumes that the practical analysis has adopted an unspecified strategy for handling anomalous characters.
Consider a set S • ≔ (S m : m 2 (M)) consisting of M sequences. The sequences are sampled simultaneously from the descendants of a single founder sequence F. Align the sequences S • , so that the sequences S • form an alignment matrix S •,• ≔ {S m,n : m 2 (M), n 2 (N)} of N columns. Given an unspecified strategy for processing sequences S • into S •,• , the analysis here simply starts with the alignment matrix S •,• . Implicitly, F aligns with S •,• , so the letter F n in F is ancestral to each letter S m,n in the matrix column n 2 (N) (where S m,n is from the sequence S m ). In practice, F is often unknown, a complication we handle shortly.
The Iverson bracket for indicator random variates is a standard notation [22]: The difference D n ≔ D n (F) ≔ M − M n (F n ) counts letters in column n that have mutated away from the founder F; let D ≔ D(F) ≔ (D n (F): n 2 (N)). Given D, let Z m ≔ Z m ðFÞ ≔ P N n¼1 ½D n ðFÞ ¼ m� count the alignment columns n 2 (N) where m letters differ from the founder letter F n , and define the site frequency spectrum (SFS) as η ≔ η(F) ≔ (η m (F): m 2(M)). Typically, F is unknown, so η is not observable.
To develop a statistical model for η, let ε n be the probability of a mutation per base per generation in column n of the alignment. As in the infinite-sites model [23], we neglect the extremely rare possibility that two or more mutations occur in the ancestry of a single letter S m,n . Let m ¼ P N n¼1 ε n be the expected number of novel mutations per generation in the sequences. Despite its biological importance, the effects of preferential selection on sequence data are practically imperceptible during the first six months of HIV infection (see the first paragraph in the Materials and Methods of [19] and Fig 1 in [24]). Assume therefore that (ε n : n 2 (N)) are all small, and that novel mutations are all independent. To a good approximation, in every daughter the counts of novel mutations are independent Poisson variates with fixed mean μ. For linguistic convenience, let every virus be both her own ancestor and her own descendant. Let A m count the non-founder viral ancestors with m descendants in the sample, and define as in [20] the ancestral sample frequency spectrum (AFS), A ≔ (A m : m 2(M)). (Each sampled sequence contributes to A 1 , e.g., because by convention, each sampled sequence is its only descendant in the sample). Under an infinite-sites model, every novel mutation occurs in a different column of the alignment. Every alignment column with m mutations therefore corresponds to a novel mutation in an ancestor with m descendants in the sample [25]. Given A, the coordinates of η are independent Poisson variates, with η m having mean μA m (see, e.g., Theorem 1 in [20]). Accordingly, the relationship is written as η= d Poission(μA), where "= d " indicates equality of distributions.
Eq (17) in [20] used the law of total variance to write Now, we restrict the discourse to the Gamma model for HIV gp120 (as detailed in the Methods section, which need not be read yet). Simulations of the Gamma model showed [20] that for μ = 0.0551 (the value for HIV gp120), the typical magnitude of the ratio m 2 s 2 ðA m Þ=ðmEA m Þ from Eq (1) was at most about 18%. For μ = 0.0551 in the Gamma model, therefore, the mutational variance mEA m makes the dominant contribution to σ 2 (η m ). The form of Eq (1) shows that the dominance remains robust to varying μ (particularly decreasing it), as long as the ratio m 2 s 2 ðA m Þ=ðmEA m Þ remains small (say, less than 50%, occurring about μ � 0.0551 × (0.50/0.18) � 0.153).
Let a m ≔ EA m and a ≔ (a m : m 2 (M)). Given the distributional equality η= d Poission(μA), our observations on Eq (1) therefore suggest that the distributional approximation η� d Poission(μa) pertains, as follows. In the present context (the distribution of η under μ = 0.0551 in the Gamma model), the variation of A contributes little to the variance of η m : effectively, as noted by other authors [1,19,24], A behaves as though it were the constant a m ¼ EA m when contributing to random fluctuations in η m . In effect, the approximation A m � a m treats Eq (1) as an expansion in μ around μ = 0 and drops terms quadratic in μ to retain the approximation s 2 ðZ m Þ � mEA m . As a linear approximation, it should improve as μ decreases and worsen as μ increases.
To avoid distracting subscripts in the following equations, let r ≔ R 0 denote the basic reproduction number R 0 from the Introduction. For some practical purposes, HIV reproduces almost in lockstep, with synchronous generations (see the Delta model of [20], a Galton-Watson branching process [26]). Let G count the generations of HIV after host infection. To summarize the previous paragraph, In any ancestry with G synchronous generations, MG ¼ X M m¼1 mA m . (Proof: count the ancestors in each of the generations g = 1, 2, . . ., G, accounting for the multiplicity m of their sampled descendants. The total count is equivalent to counting each of the M samples G times). Take expectations to derive Reference [20] showed that for the Gamma model, accurately approximated a m ¼ EA m (m = 2, 3 . . ., M). The heuristic behind the approximation follows. To a good approximation, HIV has synchronous generations g = 1, 2, . . ., G. On average, generation g contains r g individuals. Each viral sequence in the sample therefore has an approximate probability r −g of descending from any particular individual ℊ in generation g. Thus, the probability that ℊ has m descendants in a sample of size M is approximately the binomial probability on the right of Eq (4). Sum the binomial probability over the individuals ℊ in generation g (on average, r g in number) and then over all generations g = 1, 2, . . ., G. Let G tend to infinity to derive Eq (4).
Eqs (3) and (4) In statistical notations, "•" often suggests a sum, as in a ðDÞ � . The variable η in Eq (2) depends on the unknown founder sequence F. To relate η to an observable, defineM n ≔ max L2L M n ðLÞ, the maximum count of any single letter in column n. Loosely,D n ≔ M ÀM n then counts minority letters in column n. Unlike D n , the observablẽ D n has no dependency on the founder sequence F. Let bxc ≔ maxfi 2 Z : i � xg denote the floor function, withM ≔ bM=2c. Define the folded SFSη ≔ ðZ m : m 2 ðMÞÞ [27], with  (6): Henceforth and without comment, the same pattern generates folded quantities (denoted by over-tildes) from unfolded quantities, e.g.,ã ≔ EÃ. The folded SFSη inherits an approximate Poisson distribution from the SFS η: Because the pattern suggests imposing the definitioñ Eq (7) applied to the observableη yields a maximum likelihood estimate (MLE) ðĜ;rÞ. The asymptotic properties of an MLE maker a reasonable benchmark for other statistical estimates of r. Recall Eqs (8) and (9) relatingã 1 and MG, and take natural logarithms in Eq (7) to derive the (approximate) log-likelihood ln LðG; rÞ ≔ ln pðηjG; r; MÞ � À mðMG Àã � Þ þZ 1 ln ðMG Àã � Þ þ where "�" indicates an equality of functions, possibly ignoring an irrelevant additive term depending only on data (e.g., a term equaling a function ofη).
As a useful approximation, the following treats G as a continuous variate. Now, for any value of r (and not just r ¼r), where the second equality holds if the maximum is an internal maximum, so the argumentĜ is determined by setting the derivative with respect to G equal to 0 (i.e., ifĜ satisfies mðMĜ Àã � Þ ¼Z 1 ). An MLEr then maximizes the profile log-likelihood, defined as has a root at r ¼r. In the following, if an MLE lacked an explicit analytic expression, a goldensection search for maxima determined it numerically. Unfortunately, the direct method of determiningr by substituting a m ðrÞ � a ðDÞ m ðrÞ and then maximizing Eq (12) or solving Eq (13) entails many undependable numerical computations, because Eq (4) for a ðDÞ m ðrÞ requires multiplying unreasonably large and small numbers, followed by adding the resulting products with great precision. For completeness, the Supporting Information develops an approximate MLEr ðDÞ by approximating a m (r) with a ðDÞ m ðrÞ and compares it with our best estimator, derived as follows. Approximate the sum in a ðDÞ m by an integral (the first term of an Euler-Maclaurin series [28] where the final equality derives from the evaluation of a beta integral. Comparison of Eq (4) and the two sides of Eq (14) shows that approximates a m for m = 2, 3, . . ., M − 2.

Methods
Although the biological rationale given below in support of the Gamma model of HIV gp120 is brief, the interested reader can find a more detailed discussion elsewhere [20]. The Gamma model was simulated for each r = R 0 , as follows. Each realization started with a single successfully infecting founder virus, which (for bookkeeping purposes) died at time t = 0, giving birth to a random number Z 1 of successfully infecting daughters. Biologically, each infected cell produces thousands of daughter virions, each with a small independent probability of infecting. The simulation therefore chose the number Z 1 from a Poisson distribution with mean r = R 0 . Each daughter lived an independent random time after her birth. To approximate life-cycle times relevant to HIV [19,29], the random times had a gamma distribution with mean 2 days and standard deviation 0.24 days. The shape and rate parameters of the gamma distribution were therefore (n, λ) = (69.4, 34.7) (to make the mean nλ −1 = 2 and variance nλ −2 � 0.24 2 ). The life-cycle of all the founder's descendants were similar, making the Gamma model a Bellman-Harris process [21] with parameters specific to HIV. Each realization generated daughters until there were 6000 live viruses, to mimic a threshold of viral detection in blood. We also examined thresholds larger than 6000, but the exact threshold did not substantially alter scientific conclusions. If the viral population went extinct first, the realization restarted with a new founder. The 6000 live viruses were then sampled to produce six samples, of sizes M = 2 k (5 � k � 10). For each sample size M, tracing back the ancestry of the sample determined the ancestral sample frequency A, which yielded the folded ancestral sample frequencyÃ.
In HIV, the gp120 gene is about 2550 nt long, and (with crossovers neglected) each HIV replication averages ε � 2.16 × 10 −5 point mutations/base/replication [1]. On average, therefore, each RNA replication entails μ � 0.0551 mutations in gp120. Simulating from the Poisson distribution in Eq (7) with μ � 0.0551 yields a folded site-frequency spectrumη for the realization. IfZ 2 ¼Z 3 ¼ . . . ¼ZM ¼ 0, the realization fails to estimate ln r. For each r and each M, the simulation recorded the number F of failed realizations encountered before performing 1000 successful realizations.
For each of the 1000 successful realizations,η yielded lnr for the estimater ðIÞ in Eq (18). The simulation also estimated the corresponding standard deviationsŝð lnrÞ given in Eqs (19). For each r and each M, the simulation calculated a sample mean Eð lnrÞ and sample standard deviation sð lnrÞ from the 1000 successfully sampled values of lnr. It also calculated the sample mean of the estimated standard deviation Eŝð lnrÞ for comparison with the sample standard deviation. We simulated ancestries as described above for a discrete grid of basic reproduction numbers r = (1 + ε) k (k = 1, 2, . . .,blog 1+ε Rc), so 1 < r � R, with R = 10 and ε = 0.1 (as in [20]).

Results
Eq (18) suggests that X MÀ 2 m¼2 Z m tends to decrease as r increases (the Introduction refers the reader to Fig 1 in [20] for an intuitive explanation). If (18), r ðIÞ ¼ 1 so one can only infer qualitatively that r is large. Thus, if every minority letter in an alignment column is a singleton, making η −1 = 0, a realization fails to estimate r quantitatively. In each subfigure of Fig 2 forr ¼r ðIÞ (and also S1 and S2 Figs in the Supporting Information), the solid curves indicate the sample mean y ¼ E log 10r ; the two dashed curves above and below each solid curve indicate y ¼ E log 10r � sð log 10r Þ, i.e., they indicate bands above and below y ¼ E log 10r of height equal to the sample standard deviation sð log 10r Þ; and the two dot-dashed curves above and below each solid curve indicate y ¼ E log 10r � Eŝðlog 10r Þ, i.e., they indicate bands above and below y ¼ E log 10r of height equal to the sample mean of the estimated standard deviation.
In Fig 2, both x-and y-axes display log 10 r: the horizontal, the true value x = log 10 r; the vertical, the estimated value y ¼ log 10r . The x-and y-axes run from log 10 r = 0.0 to log 10 r = 1.0, i.e., r = 1 to r = 10. The dotted diagonal line indicates perfect estimation,r ¼ r. In their upper left, each of the subfigures (a)-(f) indicates the corresponding sample size M = 2 k (5 � k � 10). the true x = log 10 r. In Fig 3, y ¼ log 10r ðCÞ is consistently larger than x ¼ log 10r ðIÞ . The two estimatorsr ðCÞ andr ðIÞ agree well asr ðIÞ decreases to 1, in accord with the Taylor expansion near r = 1 following Eq (20). (The Appendix of [20] also shows that in the present context, the Delta model of [20], the coalescent model, and the continuous-time branching-process model produce the same limiting SFS as r decreases to 1). Fig 3 also shows, however, that for larger values ofr ðIÞ , the coalescent estimater ðCÞ becomes a gross overestimate, and it even blows up to infinity at log 10r ðIÞ ¼ log 10 e � 0:434.

Discussion
In infection of a single host, the basic reproduction number R 0 is the expected number of cells infected by a single infected cell. This article shows how mutational variations observed in a sample from a population descended from a single founder yield an estimatorr ðIÞ of R 0 . It verifies the accuracy of its uncontrolled approximations with simulations that show thatr ðIÞ reproduces R 0 within accuracies practicable for some purposes. Our plots and statistics took log R 0 as the natural scale for the basic reproductive number R 0 . The population size at generation g is N g = (R 0 ) g , so the population effects of changes in R 0 are linear on a log scale: log N g = g log R 0 . On one hand, increasing R 0 by 1 has a greater effect on N g when R 0 = 1 than when R 0 = 10. On the other, hand, doubling R 0 has the same additive effect on log N g independent of the value of R 0 .
Before considering the biological implications, we make a few technical statistical observations on the estimate itself. In simulations of HIV gp120 sequences, the absence of mutations common to sequence samples can cause estimation of R 0 to fail. For sample sizes M � 128, the probability of failing to estimate R 0 was less than 0.051 for all 1 < R 0 < 10 (see Fig 1). To the authors' knowledge, studies sampling HIV sequences from patients typically sample between M = 16 and M = 30 [30] sequences per patient, an insufficient depth to test the present theory.
The estimatorr ðIÞ in by Eq (18) has a simple analytic form, a negligible computation, that should be compared to the complexity of competing Bayesian calculations. Moreover, it does not contain derivatives that may lose accuracy because of numerical differencing. As noted The basic reproduction number of a pathogen in a single host after Eq (18), the coalescent and continuous-time birth-and-death branching process models of population growth yield estimators analogous tor ðIÞ . In fact, the estimators are the same single estimatorr ðCÞ . The models can be manipulated to yield other estimators, so the comparison of models is by no means exhaustive, but in the present contextr ðCÞ was clearly inferior tor ðIÞ . It even displayed a singularity with our simulated HIV data (see Fig 3), suggesting that gradual reproduction in continuous time may have its limits when modeling the lytic viral bursts of HIV.
Liker ðCÞ , but to a lesser extent, the estimatorr ðIÞ overestimated R 0 throughout the full range tested, 1 < R 0 < 10 (see Fig 2). Despite the overestimation, which became more severe as R 0 increased, estimates were accurate enough to be practicable for some purposes. As Fig 2 indicates, because of monotonicity ofr ðIÞ , experimental results with a large enough dataset can demonstrate a decrease in R 0 . Moreover, near R 0 = 1, both the overestimation and the estimated error appeared relatively small, a useful property for detecting subtle therapeutic progress in reducing R 0 . In summary, the integral estimatorr ðIÞ is easy to compute throughout the full range 1 < R 0 < 10 tested; its bias is noticeable not excessive for all sample sizes M � 32; and its error estimates are generally reliable for all sample sizes M � 128 (see Fig 2 for details).
The estimator lnr ðIÞ in Eq (18) is linear in μ, suggesting that Eq (18) provides the first term of an expansion that our approximations have linearized around μ = 0. Other articles [1,19,24] introduced and justified the linearizing approximation of an unvarying phylogeny, given here after Eq (1). The theory following Eq (1) indicates that the approximation steadily loses accuracy as μ increases, but simulations show that it retains enough accuracy to remain practicable in parameters ranges pertinent to HIV gp120.
We now turn to biological considerations. The chief limitation of our study derives from its attempt to use a simple mathematical model capture the complex biology of HIV infection. In fact, R 0 is likely to change as infection reaches new microenvironments in the host. Naïve use of the estimates here can only produce a single effective R 0 for early infection. In future (but beyond the purview of this article), we plan to incorporate time-dependencies into the simulation of R 0 , and to develop estimates that can recover some of the time-dependency.
Presently, our mathematical model assumes a single founder. The extension of mathematical modeling from a single founder to multiple founders is an important relaxation of assumptions [19]. Regardless, the single-founder assumption is often satisfied in HIV infection, because most HIV infections have a single founder. For simplicity, it also assumes a constant R 0 . After initial infection, HIV traverses different host microenvironments, potentially undergoing genetic bottlenecks. On one hand, a bottleneck can bias estimates based on sequence samples, because they obscure whether a most recent common ancestor dominates early in infection (a founder) or only after the bottleneck. In the present context, therefore, bottlenecks may bias estimates away from an initial R 0 and towards R 0 for a later microenvironment. On the other hand, even multiple escape lineages do not seriously bias genetic estimates of time since infection [24], so estimates of the initial R 0 may share a similar robustness.
Estimates of R 0 may also clarify HIV biology as an infection progresses. If target cells are scarce in a microenvironment, HIV may proliferate predominantly by cell-free spread, budding from an infected target cell, entering the extracellular fluid, and infecting another target cell by chance encounter [31]. Conversely, if target cell are plentiful, e.g., in the microenvironment of a lymph node [32,33], direct cell-to-cell spread may be more efficient than cell-free spread. Cell-to-cell spread has two distinct mechanisms (and therefore can occur in qualitatively distinct environments): (1) transmission of HIV by virological synapses between adjacent target cells or (2) transmission by capture and transfer of virions between proximal macrophages and dendritic cells [31]. Viral replication may not occur during cell-to-cell transmission, so regardless of the exact mechanism, shifts between cell-free spread and cell-to-cell spread may manifest themselves as concomitant changes in R 0 . However fundamental R 0 may be to describing the reproduction of pathogens, cell-to-cell spread exemplifies the difficulties in interpreting an estimate of R 0 biologically.
The route of infection determines the initial microenvironment of HIV. Most routes transmit HIV much less effectively than hematological routes [7], suggesting that their initial R 0 is typically low. Mucosal infection can promote transmission of HIV [34], however, because it can increase the local concentration of activated T cells, promoting cell-to-cell spread, and probably increasing the initial R 0 . Thus, the initial R 0 may provide valuable information about initial infection.
The dominance of cell-to-cell spread over cell-free spread may vary during infection. After infecting in a cell-rich mucosal microenvironment, HIV may move through the mucosal lamina, before being transported through the lymphatic system to lymph nodes, where the target cell density in its microenvironment increases dramatically [32,33]. The values of R 0 probably vary accordingly.
The present theory may also have an important application in animal trials of viral prophylaxis, when progress towards a therapy is subtle. Indeed, the design of animal trials using highdose challenges may have unintentionally impeded practical assessment of candidate HIV therapies, because some vaccines and prophylactics may mitigate low-but not high-dose challenges [6]. Repeated low-dose challenge studies represent an important step forward in the pre-clinical assessment, because they mimic typical HIV challenges in humans [3]. Repeated low-dose challenges probably yield infections with a single founder virus, satisfying the primary assumption of the present theory. An estimate of R 0 in this context provides a new variable for statistical analysis, beyond the binary infection status of an animal, one with direct biological relevance to the establishment of infection. Even if a vaccine or prophylactic fails to produce total sterilizing immunity, a reduction in the initial R 0 encourages further investigation of an intervention, where previously the entire line of research might have been discarded.
Trials using repeated low-dose challenges also pose some unanswered experimental questions, the most pressing being the possibility that unsuccessful challenges potentially perturb the challenged animals. Do unsuccessful challenges foster partial immunity to further challenge? Do they increase the probability of future infection? Subtle perturbations in R 0 may be much more sensitive than a binary infection status in providing the answers to such questions.
Next-generation deep sequencing can produce around 10 5 reads of size comparable to gp120 [35,36]. Given the expectation that future experiments will likely be able to generate datasets with potentially even greater than 10 5 sequences, any robust estimator of R 0 must be able to handle extremely large sample sizes efficiently. The consistent tightening of the error estimates and accuracy as M is increased in Eq (19) suggests that the estimatorr ðIÞ is particularly well-suited to application in such experiments. In addition, although a complete likelihood for the entire phylogeny and mutations might be desirable in some circumstances, a maximum likelihood method for the SFS permits inference about R 0 even if the reads are short, if the reads can be placed against a reference HIV genome.
In clinical trials of HIV therapies, guidelines previously suggested starting treatment only when CD4 + T cell density declines below 350 cells/μl [37]. Recent studies (notably the SPAR-TAC trial) of temporally earlier interventions have shown increased efficacy compared to standard anti-retroviral protocols [38,39]. In some circumstances, the initial R 0 might measure the clinical efficacy of some such interventions, provide a variable predictive of their efficacy, or even help predict the clinical intervention with the greatest chance of success.
In conclusion, in the case of HIV and possibly other infectious agents, the integral approximationr ðIÞ provides a simple, easily computed estimate of the early basic reproduction number R 0 in a single host. The quantitative variable R 0 makes a well-characterized biological contribution to early HIV infection and should be useful assessing the efficacy of therapies in both human and animal trials.