The Effect of Life History on Retroviral Genome Invasions

Endogenous retroviruses (ERV), or the remnants of past retroviral infections that are no longer active, are found in the genomes of most vertebrates, typically constituting approximately 10% of the genome. In some vertebrates, particularly in shorter-lived species like rodents, it is not unusual to find active endogenous retroviruses. In longer-lived species, including humans where substantial effort has been invested in searching for active ERVs, it is unusual to find them; to date none have been found in humans. Presumably the chance of detecting an active ERV infection is a function of the length of an ERV epidemic. Intuitively, given that ERVs or signatures of past ERV infections are passed from parents to offspring, we might expect to detect more active ERVs in species with longer generation times, as it should take more years for an infection to run its course in longer than in shorter lived species. This means the observation of more active ERV infections in shorter compared to longer-lived species is paradoxical. We explore this paradox using a modeling approach to investigate factors that influence ERV epidemic length. Our simple epidemiological model may explain why we find evidence of active ERV infections in shorter rather than longer-lived species.


Introduction
A significant proportion of host genomes are littered with the remnants of past retroviral infections. Termed endogenous retroviruses (ERVs), in humans ERVs represent 8% of the genome and over 10% of the genome in mice [1,2]. Infection by a retrovirus requires integration into the cellular DNA as part of its replication cycle. Integration into the germline cells and subsequent vertical transmission provides us with a genomic fossil record of multiple, independent, ancient retroviral infections, described from a wide range of vertebrate genomes, including mammals, fish, birds, reptiles and amphibians [3,4]. Typically, an ERV consists of 3 genes (gag, pol and env) and two flanking non-coding long terminal repeats (LTRs), which are identical at the time of integration. Over time, these retroviral insertions accumulate mutations and deletions at the same rate as the mutation rate of the host genome [5], rendering them non-functional. ERVs may also be inactivated by recombinational deletion between the two flanking LTRs, which removes the internal coding region, leaving a solo LTR. Solo LTRs are 10-100 times more numerous than their full length counterparts [6], and many of these insertions are fixed in the host population. In various mammal species there are a few examples of intact, evolutionarily young ERVs that are polymorphic (present in some but not all individuals) in their host population [7][8][9][10][11][12]; whether this is persisting polymorphism maintained through various evolutionary forces or actual active ERVs remains to be established. However, some ERVs do appear to be intact, capable of expression and replication [11,[13][14][15][16][17]-what we consider to be "active".
To date, no active ERVs have been discovered in humans. Most ERV research in humans is computationally based, comprising of data mining of sequenced human genomes, which has revealed numerous polymorphic ERV insertions of the youngest known family of human ERVs, HERV-K(HML2) [7,18]. The most common insertions (those insertions present at high frequency in the population) are present in the reference genome; it is unclear whether the newer insertions more recently identified are evidence of activity of this particular retroviral family, or lingering polymorphism. However, until a particular insertion in the human genome reaches a high frequency, it is unlikely to be detected unless one is specifically looking for them [7]. By the same token, any new retroviral invasions into the human genome would also be unlikely to be detected until they reached an appreciable frequency in the population. In contrast, the other mammalian genome that has been extensively studied is that of the mouse. Their use as a model organism has led to a great deal of experimental research on mice, informing our understanding of the various roles that ERVs may play and revealing a number of active ERVs [2,12,16] ("active" in the sense that there are intact copies in the genomes which are capable of expression and replication-in some cases have been shown to produce infectious viral particles). This has led to the conclusion that mice have more frequent ERV invasions than humans [19], but is this true? To ascertain the activity of ERVs in different species would require the same amount of effort that has been invested in research on mice ERVs, which in most cases is unfeasible. A search through the literature shows that despite over 3500 endogenous retroviruses having been sequenced from 138 species of mammals, the majority of ERV research has focused on just a handful of species (Table 1 -see Methods for further details). An alternative approach is to model ERV dynamics, and explore factors that would influence our chance of detecting active ERVs, such as epidemic length (i.e. how long does it take for an endogenous retroviral insertion to reach a high enough frequency that it would be detected in the few whole genomes from a species that are sequenced?).
Intuitively, there are two factors that affect the likelihood of finding active ERVs: the rate of incorporation/loss-if it varies with life history, and how long they take to run their course (i.e. the time to fixation at a specific locus), which we call the length of the epidemic. Little is known about how the rate of incorporation/loss may vary across species with different life histories.
Here we explore how we would expect the length of an epidemic to vary across different life histories: where should we be looking to identify active ERVs? Given that organisms with a faster life history are i) short lived (shorter generation times), ii) tend to have larger numbers of offspring, and iii) have a larger effective population size, we might at first expect that a beneficial insertion (those ERVs that are involved in endogenous viral element derived immunity (EDI), and would therefore be considered beneficial) would spread to fixation faster in these species, than those with a slower life history [20]. Hence, we would expect an ERV epidemic to take longer (in years) in species with slower life histories and be more easily detectable at a given time. We explore how a search across different life histories may affect what we find.
Susceptible-Infectious-Recovered (SIR) models are a useful tool for inferring the length of an epidemic [21]. They have been used extensively to describe the dynamics of various infectious disease epidemics, including foot and mouth disease [22][23][24] and measles [25][26][27]. We have previously described a SIR model to investigate the circumstances under which a disease- causing retrovirus can become incorporated into the host genome and spread through the host population, if it confers an immunological advantage [28]. This use of compartmental models is now being used by others to investigate retroviral dynamics [29]. We use the model of Kanda et al [28] to explore which factors would influence the length of an epidemic of such an ERV (i.e. one involved in EDI), and how this varies with the life history of the host.

Methods
A keyword search on the NCBI nucleotide database (http://www.ncbi.nlm.nih.gov/nuccore) for "endogenous retrovirus" AND "mammals" shows the number of endogenous retroviruses that have been described in mammals (3542 in 138 species-see S1 Table for accession numbers). Using PanTHERIA [30], we retrieved information on gestation length for 105 of these species. Species where information was not available in PanTHERIA, were excluded from this analysis. Gestation length has been shown to be a suitable indicator of speed of life history [31]. To assess the amount of "effort" focused on the ERVs of these species, we conducted a PubMed search using the "latin name" OR "common name" AND "endogenous retrovirus" as search terms (see S2 Table for PubMed search results). The number in the column headed "effort" represents the number of papers returned by PubMed. From the literature, we identified those species described as having active ERVs. This data is collated in Table 1 where species are ordered from faster life history to slow. From this table we examined the association between life history, measured as gestation length, and the number of active endogenous retroviruses reported having corrected for effort (measured as the number of papers on ERVs in the host species), using a generalized linear model (GLM) with Poisson distribution. This analysis, although relatively crude, will highlight evidence for any association between host life history and the number of active ERVs.

The Model
Kanda et al [28] developed a set of epidemiological models to investigate the conditions under which incorporation of retroviruses into the host genome benefits the host. Here, we use one of these models to examine the effects of life history on epidemic length. This is a type of compartmental model, standardly used to study the dynamics of a range of diseases [32,33]. We focus on the final model presented by Kanda et al [28] that realistically represented the dynamics of an EDI type ERV across a range of life histories. This model differs from a standard SIR model, in that it includes 2 infected compartments, I X and I E (Fig. 1). The I X compartment refers to individuals who are infected with the exogenous retrovirus; upon successful incorporation of the retrovirus into the germline of an individual, the offspring of these individuals enter the I E compartment-these individuals have an endogenous copy of the retrovirus. The model also has 2 recovered compartments, R N and R LTR . Individuals in the R N compartment have successfully dealt with the exogenous retroviral infection without incorporation of the retrovirus. The R LTR compartment consists of individuals who have recovered from the retroviral infection with an endogenous copy of the retrovirus in the genome. The model is described mathematically by the equations below and represented graphically in Fig. 2. The equations describe the flow of individuals between compartments: where t represents time, τ is the birth rate (which does not differ between compartments) and ϕ s is the survival rate of susceptible individuals. γ is the rate of incorporation of the retrovirus, α is the rate of loss of the endogenous retrovirus (mutation or recombinational deletion), β X and β E are the infection rates of the exogenous and endogenous virus respectively, θ and θ' are the rates at which immunity is acquired to the exogenous and endogenous virus respectively, ϕ X ϕ S and ϕ E ϕ S are the survival rates of the individuals infected by the exogenous and endogenous virus respectively. N(t) = S(t) + I X (t) + I E (t) + R N (t) + R LTR (t), is the total population size. Baseline values and a description of the parameters are summarised in Table 2.
Parameter values for the model The survival (ϕ s ) and fertility (τ) rate parameters define the life history of a species in this model [34]. Large ϕ s and small τ correspond to a species with a slow life history, while large τ and small ϕ s correspond to a species with a faster life history. We constrain ϕ s + τ = 1.016, as smaller values increase the probability of extinction, but above this value the extinction risk is virtually zero. We alter the life history of the species by changing the values of ϕ s and τ such that their sum always equals 1.016. As τ gets larger the life history speeds up, and as ϕ s increases the life history slows down. We vary ϕ s and τ in increments of 0.01. For each life history we then independently vary values of all other parameters as described below.
We model the invasion of an immunologically beneficial ERV, by imposing a conservative mortality increase of 1 -ϕ X = 0.03 on survival (new survival rate = ϕ X ϕ S ) attributable to the exogenous virus. For the endogenous virus, we impose a mortality increase on survival of 1 -ϕ E = 0.015 (new survival rate = ϕ E ϕ S ). ϕ E and ϕ X are reductions in ϕ S (survival). We assume that the mortality rate of the I E group is less than that of the I X compartment. Similarly, we assume the infection rates of the exogenous virus (β X ) and the endogenous virus (β E ) are also likely to differ, with the endogenous virus being less infectious than the exogenous. Baseline values are set at 0.5 and 0.4 respectively. We explore how varying these values (β X and β E ) in increments of 0.05 affect the epidemic length across different life histories.
With the exception of a few HERVs, there is little information available regarding the rate of incorporation of the retrovirus (γ) and the rate of loss (α) of the virus (mutation or recombinational deletion). The estimates that are available for humans suggest that α<γ (see [28] for further details). Our baseline values are set at α = 0.00001 and γ = 0.0001 We explore how varying these two parameters affects the length of the epidemic (from 0.0001 to 0.001). Integration of the retrovirus into the germline of these I X individuals results in the offspring of these individuals entering the I E compartment-these individuals are infected and infectious, and have an endogenous copy of the retrovirus in their genome. Individuals in the R N compartment have successfully dealt with the infection, without incorporating the virus into their genome-they are the same as the S individuals. Individuals in the R LTR compartment have also successfully dealt with the infection, but they are left with a copy of the integration in their genome (illustrated here as the LTR (green), but may also be a full length provirus (red and green as in the virus)). The rates at which immunity would arise to the exogenous virus (θ) and the endogenous virus (θ') are unknown and so we set the baseline values for these parameters at θ = 0.05 and θ' = 0.01; we have previously shown the absolute parameter values of θ, θ' and α are relatively Graphical representation of the SI X I E R N R LTR model. The circles represent compartments; the arrows represent transition rates between the compartments. Upon infection with a retrovirus, susceptible individuals (S compartment) transition to the I X compartment. If immunity is easily acquired to the retroviral infection, individuals enter the R N compartment; they have recovered without integration of the retrovirus, and the offspring of these individuals enter the susceptible compartment. If it is difficult for immunity to arise to the retrovirus, then integration of the retrovirus into the germline of these I X individuals may occur, and the offspring of these individuals enter the I E compartment. After the threat of the retroviral infection has passed, the integration is free to be lost to recombinational deletion (or mutation), and individuals enter the R LTR compartment.  unimportant, as it is their relative values that determine the dynamics [28]. As the values of θ, θ' and α approach zero, the longer the epidemic lasts and the longer simulations need to be run until the asymptotic equilibrium is reached. We vary θ and θ' from 0.0001 to 0.1 in 10 increments to determine the influence of immunity on the length of an epidemic.

Conducting the simulation
We ran simulations of the model until the proportion of the population in each class was stable to a tolerance of 0.000001, and recorded the length of time until equilibrium. We start with one I X individual and the rest of the population in the S compartment. The simulation was run for a maximum of 50000 iterations to allow the population to reach a stable equilibrium, and we calculated the number of generations this had taken. Generation length (T C ) was calculated using the equation below [35]: where a represents age, l a (or ϕ s a ) is the survivorship from birth to age a, and m a (or τ) is the fertility rate at age a. Baseline values for the various parameters are described above. All simulations were conducted in R version 2.15.2 [36].

Sensitivity of epidemic length to transition rates
Because we are interested in the effect of the range of parameter values on the length of the epidemic across life histories, we wish to examine how varying the model parameters impacts the time taken for the population to reach equilibrium. We do this by systematically altering the values of each model parameter in 10 increments (between the ranges described for each parameter above), and re-running the simulation for different life histories (differing values of ϕ s and τ).

Results
Analysis of data in Table 1 revealed a negative association between host gestation length and the number of active ERVs identified in the host species, corrected for effort (slope = -0.0264, s.d. = 0.0005, p < 0.001). Fig. 3 illustrates the effect of life history, as predicted by the GLM (line) compared to the actual data (crosses), on the number of active ERVs we see. The number of active ERVs significantly increases with the speed of host life history (as approximated by gestation length). In species where data is available, active ERVs are significantly more likely to be found in species with faster life history. This contrasts with our original intuitive expectation that ERV epidemics should run their course faster in fast-lived species, and that we should see less active ERVs in fast-lived species than slow-lived species. The relationship between the survival rate (ϕ s ) and fertility rate (τ) is illustrated in Fig. 4. Given that mean survival and mean fertility are constrained, generation length increases as mean survival increases. Generation length, by its definition, is a function of ϕ s and τ, and is unaffected by the other parameters (equation 2). The effects on the epidemic length, of varying the other transition parameters are illustrated in Fig. 5. The first striking observation overall, is that the length of the epidemic is more strongly correlated with generation length than anything else-something that is also suggested by data available from the literature (Fig. 3). The length of the epidemic is greater in species with a fast life history (short lived) than slow life histories, regardless of the parameters that describe infection rates, incorporation rate or rate of loss. For an epidemic to last, there must to be more individuals in the susceptible compartment, and in fast-lived species we have faster generation of susceptible individuals. Loss and integration can only occur during reproduction. As τ (birth rate) increases, the rate of flow of individuals into the I X compartment increases much more than the rate of flow of individuals to the R LTR compartment i.e. α > θ'α and it increases at a rate of 1 yhence the epidemic takes longer to run its course in species with a fast life history. Secondly, we observe that when it is easy to acquire immunity to the exogenous retrovirus (higher values of θ), individuals spend more time moving around the left hand side of Fig. 2 (between the S, I X and R N compartments), and the epidemic lasts longer (Fig. 5 I and J). If immunity to the exogenous virus is easy to acquire, individuals never progress to the I E compartment and the epidemic never progresses. This effect is less pronounced in species with longer life histories. At θ > 0.07, the population fails to reach a stable equilibrium in 50000 years [28].  Table 1). The black line represents the GLM predictions, whereas the crosses (blue and red) represent the actual data (from Table 1). Crosses in red indicate a high amount of "effort" (>10). The effect of most of the other parameters is negligible, with the exception of the rate of loss, α (Fig. 5 A and B), for species with a faster life history. As α increases, more individuals transition from the I E compartment to the I X compartment. This flow of individuals towards the left hand side of Fig. 2 results in the epidemic taking longer.

Discussion
Our approach of using SIR models to study the multigenerational dynamics of an endogenous retroviral infection is a novel approach, combining methods from demography [35] with epidemiological methods to address the question of ERV epidemic length. Because our model is density independent, the population size will increase exponentially, but population growth rate and structure will converge on a stable equilibrium. There is a growing body of evidence to suggest that for some ERVs, incorporation into the genome provides the host with some immunity from related exogenous retroviruses (described as endogenous viral element derived immunity, EDI [37]), through a variety of different mechanisms [10,[38][39][40][41]. Under this scenario, there is clearly an advantage to incorporation of an exogenous virus into the genome. However, despite the advantages of incorporation of retroviruses into the genome, there are also downsides to consider.
In many species there are numerous examples of ERVs causing disease [10,42,43]. In humans alone, HERVs have been associated with cancer [44][45][46], multiple sclerosis [47][48][49], and a whole host of other diseases (see [50] for review). Additionally, we also have to consider the possibility that the reverse of endogenisation may also occur-exogenous viruses emerging from active ERV lineages. The reservoir of viruses present in host genomes may also be able to recombine with exogenous retroviruses, resulting in novel recombinant viruses that maybe pathogenic-a phenomenon that has already been observed in cats [51]. It has been shown that the standard mechanism of ERV replication within a genome involves reinfection of germline cells, and hence possibly movement between host individuals [52,53]. Subsequently, crossspecies transmission of these viruses is a very real concern-there are numerous examples identified to date, where cross-species transmission of retroviruses can lead to emergent disease, e.g. HIV was certainly acquired from the non-human primate version of the virus, SIV, which has crossed the species barrier on multiple occasions from chimpanzees and sooty mangabeys [54]-SIV has also been found to be endogenous in a species of lemur [39]. Other recent cross-species transmission event include the introduction of the koala retrovirus (KoRV), The Effect of Life History on Retroviral Genome Invasions which is suspected to have originated from the exogenous Gibbon ape Leukemia Virus (GaLV) [55]. There are also several cases of close evolutionary relationships between exogenous retroviruses and ERVs in the same host species, e.g. in the sheep, cat, chicken and mouse [13,56]. Interpreting ERV diversity remains challenging and a better understanding of where we are likely to find active ERVs, and consequently possible threats of emerging disease, is clearly important in informing the direction of research in this area.
Where and when would we expect to find active ERVs?
Our model suggests that the reason why more active ERVs are discovered in species with a fast life history (such as mice and koalas), than in those with a slow life history, is not necessarily that these species have more frequent ERV epidemics, but that those epidemics last longer and are therefore more likely to be detected. For the subset of ERVs involved in EDI, the life history of the host has the greatest bearing on the length of an epidemic. The next most influential factor is the rate at which immunity arises to the exogenous retroviral infection (θ). There is a greater rate of generation of susceptible individuals in faster life histories than slower, resulting in a longer time taken for the majority of the population to reach the R LTR compartment. Not all ERVs will provide an immunity advantage. Our model applies to that subset of ERVs that are involved in EDI. The majority of active ERVs that have been described, are described in species that do have a faster life history, such as mice, koala and sheep [10,16,42], which would be in line with our predictions that these epidemics last longer in species with a fast life history, and are therefore more easily detectable. The unusually high number of active ERVs identified in mice may indeed be an anomaly; the numerous studies of ERVs in mice have focused on laboratory strains and it is possible that inbreeding and selection of certain characteristics of this model organism may have unintentionally contributed to the high levels of ERV activity observed in this particular species. For example, the first inbred mouse strain, DBA, was bred for its coat colour which has been shown to be the result of an ERV insertion [57]. However, until an equal amount of "effort" is invested into other species with similar life histories (or wild mice populations), it is difficult to ascertain whether mice are simply more susceptible to ERV infections. Nonetheless, more active ERVs are described in species with a fast life history. Table 1 illustrates the number of active ERVs that have been identified in all mammals in which ERVs have been described. There is a strong correlation between the number of active ERVs and the life history of the host (as estimated from gestation length), when weighted for effort, supporting our finding that we are more likely to find active ERVs in shorter lived species than in longer lived species (Fig. 3). Previously, these observations have been attributed to a higher level of activity of ERVs in these species (particularly in mice) [58][59][60]. In our model, this would correspond to higher values of infection rates (β X and β E ), and incorporation (γ), which interestingly, do not appear to have a considerable effect on the length of an epidemic. In light of these results, it is also worth noting that the proportion of the genome derived from ERVs in mice (short life history) and humans (long life history) is fairly similar-8% and 10% respectively [1,2]. If ERV activity were greater in species with short life histories, then we should expect more of their genomes to originate from ERV insertions. The implications of these results are that in species with a slow life history (such as humans), we should not expect to easily find active ERVs, as the epidemic occurs quickly.
Existing data on species where active ERVs have been discovered are consistent with the results of our model. However, we acknowledge that this could also be a bias in the available data, as more research in this respect has been conducted on species with faster life histories, as demonstrated in Table 1. The model we have described does not account for specific mechanisms of EDI; a better understanding regarding the mechanisms behind how retroviral immunity (EDI) is gained, which may vary with life history, would be valuable in refining this model and allowing us to better target the search for active ERVs. Further studies of a range of species, with more active ERVs and contrasting life histories, will enable a better estimation of the parameters. However, our model suggests efforts to identify active ERVs should be focused on species with faster life histories, as this is where we stand a better chance of discovering active ERVs and potential threats of new emerging infections.
Supporting Information S1 Table. Accession numbers for the 3542 endogenous retroviruses identified in the NCBI nucleotide database. (DOCX) S2 Table. Pubmed search results to assess the effort focused on the ERVs of a particular species. (XLSX)