High Heritability Is Compatible with the Broad Distribution of Set Point Viral Load in HIV Carriers

Set point viral load in HIV patients ranges over several orders of magnitude and is a key determinant of disease progression in HIV. A number of recent studies have reported high heritability of set point viral load implying that viral genetic factors contribute substantially to the overall variation in viral load. The high heritability is surprising given the diversity of host factors associated with controlling viral infection. Here we develop an analytical model that describes the temporal changes of the distribution of set point viral load as a function of heritability. This model shows that high heritability is the most parsimonious explanation for the observed variance of set point viral load. Our results thus not only reinforce the credibility of previous estimates of heritability but also shed new light onto mechanisms of viral pathogenesis.


A Distribution of genotype and phenotype in the different populations along a full replication cycle
In this section we will derive expressions for the distributions of genotypes g, environment e and phenotype φ in the populations of carriers C, donors D, recipients R and new carriers E.
The phenotype φ refers here to the log set point virus load (log spVL). The genotype g refers to 5 the virus and the environment e refers to all non-transmissible contribution to log spVL, i.e. the contributions from the host genotype, from the interactions between host and viral genotypes and from the environment. Generally, p x,Y will denote the distribution of x ∈ {g, e, φ} in the population Y ∈ {C, D, R, E}. The phenotype φ(g, e) is a function of the genotype g and the environment e. The simplest assumption is that g and e contribute additively, φ(g, e) = g + e. (A1)

A.1 Carrier population
Let the joint distribution of genotypes and environments in the carrier population be p ge,C (g, e).
Assuming that genotypes and environments are independently distributed we have, p ge,C (g, e) = p g,C (g)p e,C (e).
Here, δ is the Dirac-delta function and the asterisk denotes the convolution of the distributions p g,C and p e,C .

A.2 Donor population
Donors are selected from the current distribution of carriers according to their fitness S(φ) which depends on their phenotype φ = g + e. The joint distribution of g and e in selected donors is, p ge,D (g, e) = 1 Z s p ge,C (g, e)S(g + e) = 1 Z s p g,C (g)p e,C (e)S(g + e), where Z s is a normalization constant, Z s = p ge,C (g, e)S(g + e)dgde = p g,C (g)p e,C (e)S(g + e)dedg (A8) = p g,C (g)p e,C (φ − g)S(φ)dφdg (A9) We can then write the joint distribution of genotypes g and phenotypes φ in the selected donors, p gφ,D (g, φ) = p ge,D (g, e)δ(φ − (g + e))de (A11) The distribution of genotypes irrespective of the phenotype then is p g,D (g, φ) marginalized over 20 φ, Similarly, the distribution of the phenotype φ in the selected donors is,

A.3 Recipient population
The distribution of genotypes in the recipient population is shaped by the transmission function T (g R , g D ), which determines the genotype g R of a recipient given that the genotype of the donor was g D . So the distribution of g in the recipients is T integrated over all genotypes in the donor population, We can write the distribution of phenotype in the recipient population as, Inserting equation (A14),

A.4 Evolved population (after intrahost evolution)
25 Let E g (g E , g R ) be the function that evolves the genotype within the host. The distribution of genotypes in the evolved recipients is then, Inserting equation (A17), Due to the evolution of the virus genetics, the host-virus interactions can change. This would result in a change in the distribution of e in the evolved population. Let E e (e E , e R ) be the 30 function that evolves the interactions within the host. The distribution of environmental factors in the evolved recipients is then, We can write the distributions of phenotypes as,

B Analytical solution assuming normal distributions
While the above expressions hold for any distribution, the integral cannot be solved in the general case. If we assume normal distributions for all the different processes, we are able to 35 derive closed-form expressions.

B.1 Carriers
We assume that the distributions p g,C and p e,C are normally distributed, Here, (m C , v C ) and (µ e , ν e ) are the means and variances of the genotype and environmental distributions respectively.
Since the convolution of two Gaussian distributions with means µ 1 and µ 2 and variances σ 2 1 40 and σ 2 2 is also a Gaussian with mean µ 12 = µ 1 + µ 2 and variance σ 2 12 = σ 2 1 + σ 2 2 , the distribution of phenotypes in the carrier population p φ,C is also normal with mean, and variance,

B.2 Selected donors
Additionally, the product of two Gaussians is also a Gaussian (not necessarily normalized) with mean, and variance, Thus p e ≡ p e,C is symmetric around the mean µ e such that, and equation (A14) becomes, The convolution of p e and S has mean µ e + µ o and variance ν e + ν o . If we write, then A is a Gaussian with variance ν e + ν o and mean µ e + µ o − 2µ e = µ o − µ e . From the product formula above, p g,D ∼ N (m D , v D ), The distribution of phenotypes in the donor population follows from equation (A15) directly.

B.3 Recipients
The transmission function T determines the viral genotype of the recipient, given that the geno-50 type of the donor was g R . We assume that T is normally distributed around g R with variance ν t . Thus equation (A17) becomes where p t is a Gaussian with zero mean and variance ν t . This integral is again a convolution, Equivalently for the phenotype distribution in the recipients, from equation (A21), where p t+e is a Gaussian with mean µ 0 e and variance v t + ν 0 e . Thus the convolution is again Gaussian and p φ,

B.4 New carriers
The same as for transmission, we assume that the evolver functions for the viral and environ-55 mental contribution is E g ∼ N (g R + µ i , ν g i ) and E e ∼ N (e R + µ i e , ν i e ), respectively. The evolved population of new carriers has a genotype distribution given by equation (A23).
where p E has mean zero and variance ν g i , such that p g,E ∼ N (m C , v C ), The distribution of phenotypes in the evolved population as a function of the distribution in the recipient population is, Let p Ee (x) be a normal distribution with mean zero and variance ν i e , with f 1 a normal distribution with mean µ 0 e + µ i e and variance ν 0 e + ν i e . Integrating the convolutions further, where f 2 is a normal distribution with mean m R + µ i and variance v R + ν g i .

60
The distribution of the phenotype follows from the convolution of f 1 and f 2 , such that p φ,E ∼ N (M C , V C ),

C Equilibrium solutions for mean and variance of spVL
Concerning log spVL under the assumption of normal distributions, we have the following expressions for the distribution of log spVL in the current carriers and the carriers in the following generation, The system is said to be in equilibrium when the distribution in phenotype not longer changes from one generation to the next, thus, From equation (C4) we readily find the equilibrium solution for v C , The equilibrium solution of m C as a function of v C is then, If we assume that at equilibrium, the distributions of environmental factors no longer change from one generation of carriers to the next, then, where the prime signifies the values of mean and variance of environmental factors in the new generation of carriers. Thus the equilibrium solutions for the phenotype distribution are, We can express the equilibrium solutions in terms of the heritability h 2 , where Inserting into equation (C10), By rearranging the terms we get, Squaring both sides yield the quadratic equation, that has the solutions,Ṽ Keeping only the non-negative solution and inserting equation (C11) in the expression forM C ,

C.1 Equilibrium of environmental factors
In the main text we argue that there is good evidence that the phenotypic distribution of spVL is approximately in equilibrium, and thus M C = M C and V C = V C . In the above derivation, we assume that this also implies an equilibrium of the environmental factors, It is straightforward to see that if both the distributions for g and e are in equilibrium, then the distribution for φ is also in equilibrium. There are, however, certain special cases that can be considered where an equilibrium of φ does not imply an equilibrium of g and e. Firstly, the distribution of φ might converge faster to an equilibrium value than the distributions of g and e. This would imply that the contributions of the virus and the environment to the variance in spVL might still be changing over time. Consequently, heritability may also still 75 be changing over time. Secondly, the contributions of g and e may be diverging in opposite directions, such that the change in the distribution of g cancels out the change in the distribution of e on the population level. This scenario, however, is unlikely as it requires the viral and host/environmental factors that influence spVL to increase or decrease indefinitely. Thirdly, the change in g and e on the population level is described by a stable limit cycle, such that the 80 distribution in spVL in the population is constant through time, φ(t) = g(t) + e(t) ≡φ. While stable limit cycles can appear in theoretical models, they are rarely observed in real complex biological systems, due to the delicate balance required between the variables. Furthermore, this balance has to be maintained on a population level, which would require some sort of synchrony between the evolutionary changes happening in each individual host. We therefore 85 argue that it is most conceivable that the equilibrium of spVL in the population also implies an equilibrium of the distribution of viral and environmental effects.

D Connection to integral projection models
Our description of the distributions of log spVL change over generations has strong parallels to integral projection models used in ecology to describe how the composition of population with 90 continuous traits changes over discrete time (2)(3)(4). In this formalism, the number of individuals with trait y in generation t + 1 is given by (2), n(y, t + 1) = Ω k(y, x)n(x, t)dx.
Here, k(y, x) is called the kernel and defines the number of offspring with trait y produced by an offspring of trait x in generation t. Heritability can be viewed as the regression of offspring on parents, i.e. new carriers on old 95 carriers. As we assume the distribution of log spVL in carriers to be normal, the conditional distribution of log spVL in new carriers given an log spVL current carriers is, where ρ = V C /V C h 2 is the correlation coefficient between carriers in subsequent generations. Thus the projection kernel k(φ C , φ C ) = p(φ C |φ C ).

100
We extracted the viral load measurements from the pdf file of Geskus et al. (1) to provide a further estimate of mean and variance of viral load. This study is also based on the Amsterdam cohort, but the patient population is not identical to the one used in Fraser et al. (5). Excluding measurements that were under the detection limit we estimate a mean of 4.22 logs with a variance of 0.59. The fitted line in figure S1 shows that the distribution is well approximated by 105 a normal distribution, although a statistical test reveals a significant deviation from normality. Note that the viral loads reported in Geskus et al. (1) are not spVLs, but include also repeated measurements from individual patients. As a consequence the sample variance is likely an overestimate of the real variance of spVL.

F.1 Exact transmission potential
In this section we assess to what degree the normal approximation to the transmission potential (TP) results in a distribution of spVL in HIV carriers that is different from using the TP as reported in Fraser et al. (5). We also account for uncertainty in the transmission potential by accounting for the confidence intervals in the reported TP. To this end we simulate 20 repro- The simulated distribution of spVL in carriers after 20 cycles is shown in Figure S2. The normal approximation is in very good agreement with the simulated distribution.

F.2 Skewness in intrahost evolution and transmission bottleneck
To test the effect of deviations from normality of the processes of intrahost evolution and the 135 transmission bottleneck we sampled from a skew-normal instead of a normal distribution for both processes. The skew-normal distribution is characterized by a location, a shape and a scale parameter that together define mean, variance and skewness of the distribution. If the shape parameter is zero, the distribution has no skewness and reduces to normal distribution. To sample from the skew-normal distribution we used the rsnorm function of the VGAM pack-140 age in R (6). Figure S3 shows the effect of skewness in processes of intrahost evolution and the transmission bottleneck mean, variance and skewness of the spVL distribution in the carrier population by varying skewness in both processes from -0.9 to 0.9. The key result is that the analytical results for mean and variance of the spvL distribution remain excellent approximations even for strong skewness in the processes of intrahost evolution and the transmission

F.3 Influence of the acute and AIDS phase on the transmission potential
One concern regarding the transmission potential from Fraser et al. (5) is that it neglects transmission from the acute and the AIDS phase of the infection. This is addressed in more detail in the supplementary material of Fraser et al. (5). As described therein the required correction de-pends on the assumed model of sexual mixing and partner exchange rate. One way to account for the contribution of these phases is to add a constant term to the transmission potential. This term was estimated in Fraser et al. (5)  We performed simulations to compare the equilibrium mean and variance for a transmission potential with a constant c to the analytical expression obtained assuming c = 0 (see figure  S4). The simulations show that both mean and variance increase with increasing c. Adding a constant to the transmission potential results in overall weaker selection for viral load. This leads to a general increase in variance. The mean increases because the transmission potential 160 is weaker in opposing the force of intrahost evolution towards higher spVL. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Furthermore, we tested the effect a corrected transmission potential by repeating the rejection sampling procedure using c = 1.2 (see figure S5). Using a corrected transmission potential generally narrows down the acceptable parameter ranges (because of the effect of increasing variance and mean shown in figure S4). The areas of highest posterior probability remain in 165 regions of high heritability. Thus, in summary, modifying the transmission potential to account for the contributions of the acute and AIDS phase does not change the two key conclusions, namely that high heritability is the most parsimonious explanation for the observed mean and variance of spVL and that the forces of intrahost evolution must be weak.  figure 3 in the main text. Since no analytical solutions are available for the modified transmission potential we performed simulations to measure the approximate equilibrium mean and variance. Because of the higher computational demands we sampled 40 000 random sets of parameter values from these restricted priors: 0 < ν e < 0.6; 0 < µ i < 0.3; 0 < ν i , ν t < 0.15. For comparison, however, we plot the accepted parameters over the same range as in figure 3 in the main text.