Modern-day SIV viral diversity generated by extensive recombination and cross-species transmission

Cross-species transmission (CST) has led to many devastating epidemics, but is still a poorly understood phenomenon. HIV-1 and HIV-2 (human immunodeficiency virus 1 and 2), which have collectively caused over 35 million deaths, are the result of multiple CSTs from chimpanzees, gorillas, and sooty mangabeys. While the immediate history of HIV is known, there are over 45 lentiviruses that infect specific species of primates, and patterns of host switching are not well characterized. We thus took a phylogenetic approach to better understand the natural history of SIV recombination and CST. We modeled host species as a discrete character trait on the viral phylogeny and inferred historical host switches and the pairwise transmission rates between each pair of 24 primate hosts. We identify 14 novel, well-supported, ancient cross-species transmission events. We also find that lentiviral lineages vary widely in their ability to infect new host species: SIVcol (from colobus monkeys) is evolutionarily isolated, while SIVagms (from African green monkeys) frequently move between host subspecies. We also examine the origins of SIVcpz (the predecessor of HIV-1) in greater detail than previous studies, and find that there are still large portions of the genome with unknown origins. Observed patterns of CST are likely driven by a combination of ecological circumstance and innate immune factors.


Introduction
As demonstrated by the recent epidemics of EBOV and MERS, and by the global HIV pandemic, viral cross-species transmissions (CST) can be devastating [1,2]. As such, understanding the propensity and ability of viral pathogens to cross the species barrier is of vital public health importance. Of particular interest are transmissions that not only "spillover" into a single individual of a new host species, but that result in a virus actually establishing a sustained chain of transmission and becoming endemic in the new host population ("host switching") [3].
HIV is the product of not just one successful host switch, but a long chain of host switch events [4,5]. There are two human immunodeficiency viruses, HIV-2 and HIV-1. HIV-2 arose from multiple cross-species transmissions of SIVsmm (simian immunodeficiency virus, sooty mangabey) from sooty mangabeys to humans [6][7][8]. HIV-1 is the result of four independent cross-species transmissions from chimpanzees and gorillas. Specifically, SIVcpz was transmitted directly from chimpanzees to humans twice; one of these transmissions generated HIV-1 group M, which is the primary cause of the human pandemic [9]. SIVcpz was also transmitted once to gorillas, generating SIVgor [10], which was in turn transmitted twice to humans [11].
Looking further back, SIVcpz itself was also generated by lentiviral host switching and recombination. Based on the SIV sequences available at the time, early studies identified SIVmon/-mus/-gsn (which infect mona, mustached, and greater spot-nosed monkeys, respectively) and SIVrcm (which infects red-capped mangabeys) as probable donors [12]. Functional analysis of accessory genes from these putative parental lineages indicate that the specific donors and genomic locations of these recombination event(s) were crucial for enabling what became SIVcpz to cross the high species barrier and establish an endemic lineage in hominids [13].
The complex evolutionary history resulting in HIV illustrates the importance of natural history to modern day viral diversity, and although the history leading to HIV is well detailed, broader questions regarding cross-species transmission of primate lentiviruses remain [14]. With over 45 known extant primate lentiviruses, each of which is endemic to a specific host species [4,5,15], we sought to characterize the history of viral transmission between these species to the degree possible given the precision allowed from a limited sample of modern-day viruses.
Here, we utilized phylogenetic inference to reconstruct the evolutionary history of primate lentiviral recombination and cross-species transmission. We assembled datasets from publicly available lentiviral genome sequences and conducted discrete trait analyses to infer rates of transmission between primate hosts. We find evidence for extensive interlineage recombination and identify many novel host switches that occurred during the evolutionary history of lentiviruses. We also find that specific lentiviral lineages exhibit a broad range of abilities to cross the species barrier. Finally, we also examined the origins of each region of the SIVcpz genome in greater detail than previous studies to yield a more nuanced understanding of its origins.

Results
There have been at least 13 interlineage recombination events that confuse the lentiviral phylogeny In order to reconstruct the lentiviral phylogeny, we first had to address the issue of recombination, which is frequent among lentiviruses [16]. In the context of studying cross-species transmission, this is both a challenge and a valuable tool. Evidence of recombination between viral lineages endemic to different hosts is also evidence that at one point in time, viruses from those two lineages were in the same animal (i.e., a cross-species transmission event must have occurred in order to generate the observed recombinant virus). However, this process also results in portions of the viral genome having independent evolutionary-and phylogenetichistories.
To address the reticulate evolutionary history of SIVs, we set out to identify the extent and nature of recombination between lentiviral lineages. Extensive sequence divergence between lineages masks site-based methods for linkage estimation (S1 Fig). However, topology-based measures of recombination allow for "borrowing" of information across nearby sites, and are effective for this dataset. We thus utilized a phylogenetic model to group segments of shared ancestry separated by recombination breakpoints instantiated in the HyPhy package GARD [17]. For this analysis, we used a version of the SIV compendium alignment from the Los Alamos National Lab (LANL), modified slightly to reduce the overrepresentation of HIV sequences (N = 64, see Methods) [18]. Importantly, because each virus lineage has only a few sequences present in this alignment, these inferences refer to inter-lineage recombination, and not the rampant intra-lineage recombination common among lentiviruses.
GARD identified 13 locations along the genome that had strong evidence of inter-lineage recombination (Fig 1). Here, evidence for a particular model is assessed via Akaike Information Criterion (AIC) and differences in AIC between models indicate log probabilities, so that a delta-AIC of 10 between two models would indicate that one model is e 10/2 =~148 as likely as the other [19]. In our case, delta-AIC values ranged from 154 to 436 for each included breakpoint, indicating that these breakpoints are strongly supported by the underlying tree likelihoods. The 14 resulting segments ranged in length from 351 to 2316 bases; in order to build reliable phylogenies, we omitted two of the less supported breakpoints from downstream analyses, yielding 12 segments ranging in length from 606-2316 bases. We found no evidence to suggest linkage between non-neighboring segments (S2 Fig). While it has been previously appreciated that several lineages of SIV are recombinant products (e.g., [12,20]), the 13 breakpoints identified here provide evidence that there have been at least 13 inter-lineage recombination events during the evolution of SIVs. Identifying these recombination breakpoints allowed us to construct a putatively valid phylogeny for each segment of the genome that shares an internally cohesive evolutionary history.

Most primate lentiviruses were acquired by cross-species transmissions
We then looked for phylogenetic evidence of cross-species transmission in the tree topologies of each of the 12 genomic segments. For this and all further analyses, we constructed a dataset from all publicly available primate lentivirus sequences, which we curated and subsampled by host and virus lineage to ensure an equitable distribution of data (see Methods). This primary dataset consists of virus sequences from the 24 primate hosts with sufficient data available (5-25 sequences per viral lineage, N = 423, S3 Fig). Alignments used the fixed compendium alignment as a template (see Methods).
In phylogenetic trees of viral sequences, cross-species transmission appears as a mismatch between the host of a virus and the host of that virus's ancestor. To identify this pattern and estimate how frequently each pair of hosts has exchanged lentiviruses, we used the established methods for modeling evolution of discrete traits, as implemented in BEAST [21,22]. In our case, the host of each viral sample was modeled as a discrete trait. This is analogous to treating the "host state" of each viral sample as an extra column in an alignment, and inferring the rate of transition between all pairs of host states along with inferring ancestral host states across the phylogeny. This approach is similar to common phylogeographic approaches that model movement of viruses across discrete spatial regions [23] and has previously been applied to modeling discrete host state in the case of rabies virus [21]. Here, we took a fully Bayesian approach and sought the posterior distribution across phylogenetic trees, host transition rates and ancestral host states. We integrate across model parameters using Markov chain Monte Carlo (MCMC). The resulting model provides phylogenetic trees for each segment annotated with ancestral host states alongside inferred transition rates. Fig 2 shows reconstructed phylogenies for 3 segments along with inferred ancestral host states. Trees are color coded by known host state at the tips, and inferred host state at internal nodes/branches; color saturation indicates the level of certainty for each ancestral host assignment. A visual example of how the model identifies cross-species transmissions can be seen in the SIVmon/SIVtal clade, which infect mona-and talapoin monkeys, respectively (starred in Fig 2A-2C). Due to the phylogenetic placement of the SIVmon tips, the internal node at the base of this clade is red, indicating that the host of the ancestral virus was most likely a mona monkey. This contrasts with the host state of the samples isolated from talapoin monkeys (tips in green). These changes in the host state across the tree are what inform our estimates The SIV LANL compendium, slightly modified to reduce overrepresentation of HIV, was analyzed with GARD to identify 13 recombination breakpoints across the genome (dashed lines in B; numbering according to the accepted HXB2 reference genome-accession K03455, illustrated). Two of these breakpoints were omitted from further analyses because they created extremely short fragments (< 500 bases; gray dashes in B). For each of the 11 remaining breakpoints used in further analyses, we split the compendium alignment along these breakpoints and built a maximum likelihood tree, displayed in A. Each viral sequence is color-coded by host species, and its phylogenetic position is traced between trees. Heuristically, straight, horizontal colored lines indicate congruent topological positions between trees (likely not a recombinant sequence); criss-crossing colored lines indicate incongruent topological positions between trees (likely a recombinant sequence).
https://doi.org/10.1371/journal.ppat.1006466.g001 of the rate of transmission between host pairs. In total, the support for each possible transmission is derived from both A) whether the transmission is supported across the posterior distribution of phylogenies for a particular segment, and B) whether this is true for multiple genomic segments. Bayesian maximum clade credibility (MCC) trees are displayed for segments 2 (gag-A), 6 (int and vif-B), and 9 (env-C) of the main dataset (N = 423). Tips are color coded by known host species; internal nodes and branches are colored by inferred host species, with saturation indicating the confidence of these assignments. Monophyletic clades of viruses from the same lineage are collapsed, with the triangle width proportional to the number of represented sequences. An example of likely cross-species transmission is starred in each tree, where the host state at the internal node (red / mona monkeys) is incongruent with the descendent tips' known host state (green / talapoin monkeys), providing evidence for a transmission from mona monkeys to talapoin monkeys. Another example of cross-species transmission of a recombinant virus among African green monkeys is marked with a dagger. D-The genome map of SIVcpz, with breakpoints used for the discrete trait analysis, is color coded and labeled by the most likely ancestral host for each segment of the genome. Notably, the tree topologies are substantially different between segments, which emphasizes both the extent of recombination and the different evolutionary forces that have shaped the phylogenies of individual portions of the genome. In all segments' trees, we also see frequent changes in the host state between internal nodes (illustrated as changes in color going up the tree), suggestive of frequent ancient cross-species transmissions. On average, primate lentiviruses switch hosts once every 6.25 substitutions per site per lineage across the SIV phylogeny.
The cross-species transmission events inferred by the model are illustrated in Fig 3, with raw rates and Bayes factors (BF) in S4 Fig. As shown, the model correctly infers nearly all pairs of hosts with previously identified (to our knowledge) CST events [7,9,11,12,24,25], with the exception of the putative CST from sooty mangabeys to sabaeus monkeys reported by [20] (see Discussion). Importantly, we also identify 14 novel cross-species transmission events with strong statistical support (cutoff of BF ! 10.0). Each of these transmissions is clearly and robustly supported by the tree topologies (all 12 trees are illustrated in S5 Fig). To control for sampling effects, we repeated the analysis with a supplemental dataset built with fewer hosts, and more sequences per host (15 host species, subsampled to 16-40 sequences per viral lineage, N = 510), and see consistent results. As illustrated in S6-S8 Figs, we find qualitatively similar results. When directly comparing the average indicator values, host state transition rates, and BF values between analyses, the results from the main and supplemental datasets are strongly correlated, indicating robust quantitative agreement between the two analyses (S9 Fig). We also see similar agreement between replicates when we regenerate the main dataset via independent sampling draws (see online repository).
Taken together, these results represent a far more extensive pattern of CST among primate lentiviruses than previously described [4,5]; nearly every primate clade has at least one inbound, robustly supported viral transmission from another clade. We thus conclude that the majority of lentiviruses have arisen from a process of host switching, followed by a combination of intraclade host switches and host-virus coevolution.

Primate hosts may serve as either sources or sinks of SIVs
While most SIVs are the product of ancient recombination and host switching, the distribution of these host switches is not uniform; when we assess the network centrality of each virus we find a broad range (Figs 3, S7 and S11, as node size), indicating some hosts act as sources in the SIV transmission network and other hosts act as sinks. From this, we infer that some viruses have either had greater opportunity or have a greater ability to cross the species barrier than others.
In particular, the SIVs from the four closely related species of African green monkeys (SIVsab, SIVtan, SIVver, SIVgrv; collectively, SIVagm) appear to exchange viruses with other host species frequently (Fig 3, 12' o-clock). An example of SIVagm CST events can be seen in the tree topologies from gag, prot, reverse transcriptase (RT), and vif (Figs 2 and S5, segments 2, 3, 4, & 6). Here, SIVtan isolates reported by Ayouba et al. [26] (denoted with ✝ in Fig 2) clearly cluster with SIVsab, in a distant part of the phylogeny from the rest of the SIVagm viruses (including the majority of SIVtan isolates). For all other segments, however, the SIVagms cluster together. We thus concur with the conclusion of Ayouba et al. that these samples represent a recent spillover of SIVsab from sabaeus monkeys to tantalus monkeys, and the model appropriately identifies this transmission.
Contrastingly, previous studies of the lentiviral phylogeny have noted that SIVcol is typically the outgroup to other viral lineages, and have hypothesized that this may implicate SIVcol as the "original" primate lentivirus [27]. We find this hypothesis plausible, but the evidence remains inconclusive. For the majority of genomic segments, we also observe SIVcol as the clear outgroup (Figs 2 and S5). In contrast, for portions of gag/pol (segments 3 and 4) and some of the accessory genes (segment 7), we find that there is not a clear outgroup. For these segments, many other lineages of SIV are just as closely related to SIVcol as they are to each other. However, with the occasional exception of single heterologous taxa with poorly supported placement, SIVcol remains a monophyletic clade (N = 16), and does not intercalate within the genetic diversity of any other lineage in our dataset. Based on these collective tree topologies, our model does not identify strong evidence for any specific transmissions out of colobus monkeys, and identifies only a single, weakly supported inbound transmission (likely noise in the model caused by the fact that red-capped mangabeys are the marginally supported root host state; see below). This is consistent with previous findings that the colobinae-in a different genus than most of the Cercopithecus primates in our dataset-have a unique variant of the APOBEC3G gene, which is known to restrict lentiviral infection and speculated to be a barrier to cross-species transmission [27]. These observations generally support the idea of SIVcol as having maintained a specific relationship with its host over evolutionary time.
Additionally, while most host species carry only one lineage of SIV, mandrils and mustached monkeys carry 2 and 3 lineages of SIV, respectively [28][29][30][31]. In agreement with these previous studies, SIVmnd-1 and SIVmnd-2 do not always cluster together in the phylogeny; the same is true for SIVmus-1, SIVmus-2, and SIVmus-3, indicating that each of these viral lineages likely has a unique origin. This stands in stark contrast to baboons, which have only been infected by an SIV via a single documented spillover event [24].
Collectively, these examples demonstrate that the nature of the host-virus relationship is highly variable for primate lentiviruses, with some viruses switching hosts often while others putatively maintain strict host specificity. Likewise, while some hosts have acquired multiple SIV lineages, most are infected by only one SIV, or do not have an endemic SIV.
SIVcpz, the precursor to HIV-1, has a mosaic origin with unknown segments Unlike SIVcol, SIVcpz appears to be the product of multiple CSTs and recombination events. SIVcpz actually encompasses two viral lineages: SIVcpzPtt infects chimpanzees of the subspecies Pan troglodytes troglodytes, and SIVcpzPts infects chimpanzees of the subspecies Pan troglodytes schweinfurthii [32]. There are two additional subspecies of chimpanzees that have not been found harbor an SIV despite extensive surveys, suggesting that SIVcpzPtt was acquired after chimpanzee subspeciation [25]. Both this previous work and our own results support the hypothesis that SIVcpz was later transmitted from one chimpanzee subspecies to the other, and SIVcpzPtt is the only SIVcpz lineage that has crossed into humans. Given the shared ancestry of the two lineages of SIVcpz, we use "SIVcpz" to refer specifically to SIVcpzPtt.
Based on the lentiviral sequences available in 2003, Bailes et al [12] suggested that the SIVcpz genome is a recombinant of just two parental lineages. SIVrcm (which infects redcapped mangabeys) was identified as the 5' donor, and an SIV from the SIVmon/-mus/-gsn clade (which infect primates in the Cercopithecus genus) was identified as the 3' donor. Since the time of this previous investigation many new lentiviruses have been discovered and sequenced. In incorporating these new data, we find clear evidence that the previous twodonor hypothesis may be incomplete.
The tree topologies from env in the 3' end of the genome (segments 8-11) support the previous hypothesis [12] that this region came from a virus in the SIVmon/-mus/-gsn clade. These viruses form a clear sister clade to SIVcpz with high posterior support (Fig 2C and 2D). We find strong evidence for transmissions from mona monkeys (SIVmon) to mustached monkeys (SIVmus), and from mustached monkeys to greater spot-nosed monkeys (SIVgsn) (see Discussion of potential coevolution below). We also find more evidence in support of a transmission from mona monkeys to chimpanzees than from the other two potential donors, but more sampling is required to firmly resolve which of these viruses was the original donor of the 3' end of SIVcpz.
We find phylogenetic evidence to support the previous hypothesis [12,13] that the integrase and vif genes of SIVcpz (segments 4-6) originated from SIVrcm; however, we find equally strong evidence to support the competing hypothesis that pol came from SIVmnd-2, which infects mandrils (Fig 2B and 2D). In these portions of the genome, SIVmnd-2 and SIVrcm together form a clear sister clade to SIVcpz. The vpr gene, in segment 7, is also closely related to both SIVrcm and SIVmnd-2, but this sister clade also contains SIVsmm from sooty mangabeys. Notably, we infer a transmission from red-capped mangabeys to mandrils, but we cannot determine whether this portion of the SIVcpz genome was acquired directly from SIVrcm or from SIVmnd-2.
Interestingly, we do not find evidence to support either SIVrcm/mnd-2 or SIVmon/mus/gsn as the donor for the 5' most end of the genome (segments 1-5), including the 5' LTR, gag, and RT genes. This is also true for the 3' LTR (segment 12). SIVcpz lacks a clear sister clade or ancestor in this region, and SIVrcm groups in a distant clade; we therefore find no evidence to suggest that an ancestor of an extant SIVrcm was the parental lineage of SIVcpz in the 5' most end of the viral genome as previously believed (Fig 2A and 2D). This may support the possibility of a third parental lineage, or a number of other plausible scenarios (discussed below).

Discussion
Limitations and strengths of the model Additional sampling is required to fully resolve the history of CST among lentiviruses. In addition to the 14 strongly supported novel transmissions (BF ! 10) described above, we also find substantial evidence for an additional 8 possible novel transitions, but with lower support (BF ! 3) (Figs 3, S4). These transmissions are more difficult to assess, because many of them are inferred on the basis of just a few "outlier" tips of the tree that group apart from the majority of viral samples from the same lineage. In each case, the tips' phylogenetic position is strongly supported, and the primary literature associated with the collection of each of these "outlier" samples clearly specifies the host metadata. However, due to the limited number of lentiviral sequences available for some hosts, we are unable to control for sampling effects for some of these lower-certainty transmissions. We report them here because it is unclear whether these outliers are the result of unidentified separate endemic lineages, onetime spillovers from other hosts, or species misidentification during sample collection. It is also important to note that while some of these less-supported transmissions are potentially sampling artifacts, many of them may be real, and may be less supported simply because they lack the requisite available data for some genome segments.
Ultimately, far more extensive sampling-specifically, obtaining more full-length sequences from undersampled lineages-of primate lentiviruses is required in order to resolve these instances. We included only sequences at least 500 bases long; each taxon may contribute more informative sites to some segments than to others. When splitting the master alignment along breakpoints, we removed from each segment any taxon that had no informative bases. However, for each segment, there were between 0 and 13 (mean: 3.6) taxa that had some informative sites, but were very short (<100 informative sites). Statistically, these short taxa contribute little information, and are placed in the topology for each segment with high uncertainty. This phylogenetic uncertainty is then propagated forward to the discrete traits model, meaning that these short taxa should not statistically influence our results in any meaningful way. Notably, though, their removal does result in extensive technical challenges (this "pruned" dataset results in poor mixing and rather divergent results, seen in S10-S12 Figs). Given the high congruence between results from independent sampling replicates of the main dataset and from an alternative sampling scheme, we believe this to be a technical issue, rather than reflecting true differences. However, this issue does further emphasize the importance of additional sampling in fully resolving the natural history of SIVs.
Most lentiviruses were originally acquired by CST and have since coevolved with their hosts. Some of these "noisier" transmission inferences, particularly within the same primate clade, may be the result of coevolution, ie. lineage tracking of viral lineages alongside host speciation. Within the model, viral jumps into the common ancestor of two extant primate species appear as a jump into one of the extant species, with a secondary jump between the two descendants. For example, the model infers a jump from mona monkeys into mustached monkeys, with a secondary jump from mustached monkeys into their sister species, red-tailed guenons (Fig 3). Comparing the virus and host phylogenies, we observe that this host tree bifurcation between mustached monkeys and red-tailed guenons is mirrored in the virus tree bifurcation between SIVmus and SIVasc for most segments of the viral genome (Figs 2 and 3). This heuristically suggests that the true natural history may be an ancient viral transmission from mona monkeys into the common ancestor of mustached monkeys and red-tailed guenons, followed by host/virus coevolution during primate speciation to yield SIVmus and SIVasc.
The possibility of virus/host coevolution means that while we also observe extensive host switching between primate clades, many of the observed jumps within a primate clade may be the result of host-virus coevolution. However, we also note that the species barrier is likely lower between closely related primates, making it challenging to rigorously disentangle coevolution vs. true host switches within a primate clade [33].
The model propagates and accounts for residual uncertainty from the recombination analyses. The deep phylogenies and extensive sequence divergence between SIV lineages makes any assessment of recombination imperfect. Our analysis aims to 1) place a lower bound on the number of interlineage recombination events that must have occurred to explain the observed extant viruses, and 2) use this understanding to construct a model of CST among these viruses. As the most well developed package currently available for topology-based recombination analysis, GARD was an appropriate choice to identify in broad strokes the extent and nature of recombination among SIVs. Some (though not all) of the remaining uncertainty as to the exact location of breakpoints is represented within the posterior distribution of trees for each segment, and is thus propagated and accounted for in the discrete traits model (inferences of CST). Most other studies of SIV evolutionary history simply split the phylogeny along gene boundaries or ignore recombination (e.g., [5,12,33]). Thus, while there is still some remaining uncertainty in our recombination analyses, these results still represent a major step forward in attempting to systematically assess recombination among all extant SIV lineages and to incorporate it into the phylogenetic reconstruction.
Cross-species transmission is driven by exposure and constrained by host genetic distance Paleovirology, biogeography, and statistical models of lentiviral evolution estimate that primate lentiviruses share a common ancestor approximately 5-10 million years ago [27,[34][35][36]. This, along with the putative viral coevolution during primate speciation, suggests that many of these transmissions were ancient, and have been acted on by selection for millions of years. Thus, given that the observed transmissions almost exclusively represent evolutionarily successful host switches, it is remarkable that lentiviruses have been able to repeatedly adapt to so many new host species. In the context of this vast evolutionary timescale, however, we conclude that while lentiviruses have a far more extensive history of host switching than previously understood, these events remain relatively rare overall.
As noted above, our results illustrate that some SIVs cross the species barrier more readily than others, and some primate host species become infected with new viral lineages more commonly than others. This is likely governed by both ecological and biological factors. Ecologically, frequency and form of exposure are likely key determinants of transmission [3], but these relationships can be difficult to describe statistically. For example, many primates are chronically exposed to many exogenous lentiviruses through predation [5,15,37]. Using log body mass ratios [38] as a proxy for predation, we do not see a statistically significant association between body mass ratio and non-zero transmission rate (Fig 4A, blue; p = 0.678, coef. 95% CI [-0.311, 0.202]). We believe the lack of signal is likely due to the imperfect proxy, although it is also possible that predator-prey relationships do not strongly structure the CST network. It is also likely that geographic overlap and habitat similarity are ecological determinants of SIV CST, but modern primate habitats are likely very different since the time that these transmissions actually occurred.
Biologically, increasing host genetic distance has a clear negative association with the probability of cross-species transmission (Fig 4B, blue, p<0.001, coef. 95% CI [-7.633, -2.213]). Importantly, as already discussed, the strength of this association may be inflated by instances of lineage tracking (virus/host cospeciation). However, it is well established that increasing host genetic distance is associated with a higher species barrier [33]. As previously documented in the literature we expect this is due to several factors, such as the divergence of host restriction factor genes, which are key components of the innate immune system (reviewed in [39]) and differences in host cell receptor phenotypes (e.g., [40][41][42]). Functional assays of these host phenotypes against panels of SIVs, while outside the scope of this study, will be important for further identifying the molecular bases of the species barriers that have led to the transmission patterns identified here.

Origins of HIV-1 and HIV-2
Epidemiological factors were key to the early spread of HIV. Understanding the underlying dynamics of lentiviral CST provides important ecological context to the transmissions that generated the HIV pandemic. As discussed above, our results support a view of lentiviral cross-species transmission as a rare event. Notably, only two lentiviruses have crossed the high species barrier from Old World monkeys into hominids: SIVsmm and the recombinant SIVcpz. Both HIV-1 and HIV-2 have arisen in human populations in the last century [5,43]. While it is possible that this has occurred by chance, even without increased primate exposure or other risk factors, we nevertheless find it striking that humans would acquire two exogenous viruses within such a short evolutionary timespan.

Fig 4. Logistic regressions of body mass ratios (a proxy for predation) and host genetic distance on the probability of CST.
For each pair of host species, we (A) calculated the log ratio of their average body masses and (B) found the patristic genetic distance between them (from a maximumlikelihood tree of mtDNA). To investigate the association of these predictors with cross-species transmission, we treated transmission as a binary variable: 0 if the Bayes factor for the transmission (as inferred by the discrete traits model) was < 3.0, and 1 for a Bayes factor ! 3.0. Each plot shows raw predictor data in gray; the quintiles of the predictor data in green; and the logistic regression and 95% CI in blue.
https://doi.org/10.1371/journal.ppat.1006466.g004 SIV diversity generated by recombination, cross-species transmission Examining this phenomenon more closely, the history of HIV-2 is enlightening. HIV-2 has been acquired through at least 8 independent spillover events from sooty mangabeys [5]. Notably, 6 of these transmissions have resulted in only a single observed infection (spillovers) [7,8,44]; only 2 of these events have established sustained transmission chains and successfully switched hosts to become endemic human pathogens [45][46][47]. This pattern, as well as serology-based reports of other limited spillovers of SIVs into humans [28,48], suggest that there have been many isolated introductions of lentiviruses into humans over the past 200,000 years. However, these other viral exposures did not result in new endemic human pathogens either because of species-specific immune barriers, non-conducive epidemiological conditions, or a combination thereof. The rapid and repeated emergence of HIV-1 and HIV-2 is on a timescale more congruent with changes in epidemiological conditions than mammalian evolution, perhaps emphasizing the importance of the concurrent changes in human population structure and urbanization in facilitating the early spread of the epidemic [43]. Significantly, though, this also highlights the importance of careful public health surveillance and interventions to prevent future epidemics of zoonotic viruses.
The exact origins of SIVcpz may not be identifiable. The ancestry of SIVcpz appears to be tripartite: the unknown ancestry of the 5' end; the putatively SIVrcm or SIVmnd-2 derived int and vif genes; and the putatively SIVgsn/mon/mus derived 3' end of the genome. For any of these three portions of the genome, there are multiple evolutionary histories supported by available sequence data. However, it is possible that further sampling of lentivirus lineages (both known and currently undiscovered) will be able to definitively resolve the ancestry of SIVcpz. Alternatively, it may be that the ancestral virus(es) that gave rise to any of these three portions is extinct [49]. In the case of the last two portions of the SIVcpz genome, it may be that the common ancestor of these putative genetic donors (SIVrcm/mnd-2 and SIVmon/mus/ gsn, respectively) was the true source. However, it is also a distinct possibility that SIVcpz has sufficiently diverged since its acquisition by chimpanzees such that its history is obscured.
Evolutionary time obscures the identity of the "original" primate lentivirus. Among primates, our results clearly illustrate that the vast majority of lentiviral lineages were originally acquired by cross-species transmission. It is intriguing to speculate as to which virus was the "original" source of all of these lineages. Because of its consistent position as the outgroup of primate lentiviral trees, it has been hypothesized [27] that SIVcol was this original lentivirus among primates. While SIVcol is certainly the most evolutionary isolated extant lentivirus that has been sampled to date, this does not definitively place it as the ancestral lentivirus. Alternative scenarios (also noted by [27,49]) include an extinct original lentiviral lineage (and/or primate host species) or an unsampled ancestral lentivirus. It is also plausible that another known extant lentivirus was the "original" lineage, but has diverged and/or recombined to such an extant that its origins are obscured.

Conclusions
Here, we have shown that lentiviruses have a far more extensive history of host switching than previously described. Our findings also demonstrate that the propensity of each lentiviral lineage to switch between distant hosts, or to spillover among related hosts, is highly variable. In examining specific lineages, our findings are consistent with the prevalent hypothesis that SIVcol has evolved in isolation from other SIVs. Contrastingly, we have also demonstrated that the mosaic origins of SIVcpz are far more complex than previously recognized; the currently available sequence data is unable to resolve the ancestry of nearly half of the SIVcpz genome. Together, our analyses move closer to a full understanding of the pattern of cross-species transmission among primate lentiviruses, but additional efforts to obtain high quality viral genome sequences from under sampled lineages will be necessary to fully resolve the natural history of these viruses.

Methods
All datasets, config files, documentation, and scripts used in this analysis or to generate figures are available publicly at <https://github.com/blab/siv-cst>.

Datasets & alignments
Lentiviral genomes are translated in multiple reading frames; we therefore utilized nucleotide sequence data for all analyses.
Recombination analysis dataset. For all recombination analyses (GARD, R 2 , rSPR), we used the 2015 Los Alamos National Lab (LANL) HIV/SIV Compendium [18]. The compendium is a carefully curated alignment of high-quality, representative sequences from each known SIV lineage and each group of HIV-1 and HIV-2. We reduced the overrepresentation of HIV in this dataset, but maintained at least one high quality sample from each group of HIV. In total, this dataset contains 64 sequences from 24 hosts (1-10 sequences per host). Maximum likelihood trees for each segment-used for the rSPR analysis and displayed in Fig  1-were built with RAxML v.8.2.9 [50] with the rapid bootstrapping algorithm under a GTR model with 25 discrete bins of site to site rate variation and were rooted to SIVcol for display.
Main & supplemental datasets. Primate lentiviral sequences were downloaded from the comprehensive LANL HIV sequence database [18]. Sequences from lineages known to be the result of artificial cross-species transmissions (SIVmac, -stm, -mne, and -wcm) were excluded. We also excluded any sequences shorter than 500 bases or that were flagged as problematic by standard LANL criteria. We grouped host subspecies together except for cases where there is a known specific relationship between host subspecies and virus lineage (chimpanzees and African green monkeys). To construct datasets with a more equitable distribution of sequences per host, we preferentially subsampled sequences from the LANL Compendium, followed by samples isolated from Africa (more likely to be primary sequences), and finally supplementing with samples isolated elsewhere (excluding experimentally generated sequences). For humans, mandrils and mustached monkeys, which are host to 2, 2 and 3 distinct viral lineages, respectively, we allowed a few additional sequences (if available) to represent the full breadth of documented lentiviral diversity. The "main" dataset consists of 24 host species, with 5-31 sequences per host (total N = 422). As an alternative dataset to control for sampling bias and data availability, we also constructed a "supplemental" dataset with just 15 hosts but with more viral sequences per host (16-40 sequences per lineage, N = 510).
Alignments. Alignments were done with the L-INSI algorithm implemented in mafft v7.294b [51]; the Compendium alignment was held fixed, with other sequences aligned to this template. Insertions relative to the fixed compendium were discarded. This alignment was then split along the breakpoints identified by GARD to yield the segment-specific alignments. Sequences that contained no bases for a given segment were removed; the "pruned" dataset is identical to the "main" dataset, but here we also removed sequences with fewer than 100 ungapped bases for a given segment.

Recombination analyses
Topology-based analysis. Each portion of the genome was analyzed with the Genetic Algorithm for Recombination Detection (GARD), implemented in HyPhy v.2.2.0 [17], with a nucleotide model selected by HyPhy's NucModelCompare package (#012234) and general discrete distribution (3 bins) of site variation.
Computational intensity was eased by analyzing the recombination dataset in 3kb long portions, with 1kb overlaps on either end (e.g., bases 1:3000, 2000:5000, 4000:7000, etc.). To control for sites' proximity to the ends of the genomic portion being analyzed, this was repeated with the windows offset (e.g., bases 1:2500, 1500:4500, 3500:6500, etc.). In total, this resulted in every site of the alignment being analyzed at least two-fold, with at least one of these replicates providing 500-1500 bases of genomic context on either side (other than at the very ends of the total alignment). Disagreement between window-offset replicates for a given breakpoint were minimal and almost always agreed within a few bases, with two exceptions: offset replicates for the breakpoint between segments 7 and 8 disagreed by 529 bases, and offset replicates for the breakpoint between 8 and 9 disagreed by 263 bases. In these instances, we used the average site placement.
Sitewise linkage analysis. Biallelic sites were identified across the genome (ignoring gap characters and polymorphisms present at <5%). These biallelic sites were compared pairwise to generate an observed and expected distribution of haplotypes (combinations of alleles between the two sites), and assessed with the statistic R 2 = χ 2 /(d × N), where χ 2 is chi-square, d is the degrees of freedom, and N is the number of haplotypes for this pair of sites [52]. This statistic follows the canonical interpretation of R 2 , i.e., if the allele at site 1 is known, how well does it predict the allele at site 2 (0 indicating no linkage between sites, and 1 indicating perfect linkage).
Segment linkage analysis. Segment linkage was assessed by comparing the similarity of tree topologies between segments. This was done with the pairwise method in the Rooted Subtree-Prune-and-Regraft (rSPR) package [53], which measures the number of steps required to transform one topology into another. Segment pairs with similar topologies have lower scores than segments with less similar topologies.

Phylogenies & discrete trait analysis
Empirical tree distributions. For each segment alignment, a posterior distribution of trees was generated with BEAST v. 1.8.2 under a general time reversible substitution model with gamma distributed rate variation and a strict molecular clock [54]. We used a Yule birthdeath speciation model tree prior [55]; all other priors were defaults. Trees were estimated using a Markov chain Monte Carlo (MCMC), and convergence was determined by visually inspecting the trace in Tracer v. 1.6.0 [56]. MCMC chain parameters and ESS values were as follows.
"Resampled" dataset: 65 million steps, sampled every 15,000 states, 10% burnin removed. Posterior distribution of~3900 trees per segment. All ESS values except for the AG substitution rate in segment 2 (ESS = 190) were well above 200.
Discrete trait analysis. These tree distributions were used to estimate the transmission network. As in Faria et al. [21], we model hosts as states of a discrete trait, and model crossspecies transmission as a stochastic diffusion process among n host states. We use a nonreversible state transition matrix with n × (n-1) individual transition parameters [57]. We also utilize standard methodology in using a Bayesian stochastic search variable selection process to encourage a sparse network, implemented as in [22]. An exponential distribution with mean = 1.0 was used as a prior on each of the pairwise rate parameters (552 in total for the main and pruned datasets with 24 hosts; 210 parameters in total for the supplemental dataset with 15 hosts). The prior placed on the sum of the binary indicator variables corresponding to each pairwise rate parameter (i.e., the number of transitions that are non-zero) was an exponential distribution with mean = 20-23. Convergence was again assessed by visually inspecting the log files in Tracer. MCMC chain parameters and ESS values were as follows.
"Pruned" dataset: decreased phylogenetic noise in the empirical tree distributions for this dataset led to narrow tree distributions and poor mixing (but clear convergence). We thus ran 40 parallel chains of 25-35 million steps each, sampled every 20,000 states (trees sampled every 50,000 states). We discarded 10 million steps of each chain as burnin and concatenated the remaining states.
For the main and supplemental datasets, more than 97% of the rate, indicator, and actual-Rate (stepwise rate×indicator values) parameters, and all other parameters, had an ESS well over 200 (most > 1000). The pruned dataset had a posterior ESS of 677, but the component tree likelihood ESS values ranged from 57-287; 91% of the rate, indicator, and actualRate parameters had an average ESS value > 200.
Statistical support for each transmission is summarized as a Bayes factor (BF), calculated by comparing the posterior and prior odds that a given rate is non-zero (i.e., that there has been any transmission between a given pair of hosts) [22]. The ancestral tree likelihoods of each of the 12 tree distributions contribute equally to the inference of a shared transmission rate matrix. However, not every lineage has recombined along every breakpoint, which means that the tree likelihoods from each segment are not fully statistically independent. Thus, we report conservative estimates of BF by dividing all BF values by 12. Values for the asymmetric transition rates between hosts, as estimated by the CTMC, were calculated as rate × indicator (element-wise for each state logged). We report the average posterior values above. Bayes factors represent a ratio of the posterior odds / prior odds that a given actual rate is non-zero. Because each of the 12 segments contributes to the likelihood, but they have not evolved independently, we divide all Bayes factors by 12 and report the adjusted values above (and throughout the text). Values for the asymmetric transition rates between hosts, as estimated by the CTMC, were calculated as rate × indicator (element-wise for each state logged). We report the average posterior values above. Bayes factors represent a ratio of the posterior odds / prior odds that a given actual rate is non-zero. Because each of the 12 segments contributes to the likelihood, but they have not evolved independently, we divide all Bayes factors by 12 and report the adjusted values above (and throughout the text). Values for the asymmetric transition rates between hosts, as estimated by the CTMC, were calculated as rate × indicator (element-wise for each state logged). We report the average posterior values above. Bayes factors represent a ratio of the posterior odds / prior odds that a given actual rate is non-zero. Because each of the 12 segments contributes to the likelihood, but they have not evolved independently, we divide all Bayes factors by 12 and report the adjusted values above (and throughout the text).