Endogenous Viral Elements in Animal Genomes

Integration into the nuclear genome of germ line cells can lead to vertical inheritance of retroviral genes as host alleles. For other viruses, germ line integration has only rarely been documented. Nonetheless, we identified endogenous viral elements (EVEs) derived from ten non-retroviral families by systematic in silico screening of animal genomes, including the first endogenous representatives of double-stranded RNA, reverse-transcribing DNA, and segmented RNA viruses, and the first endogenous DNA viruses in mammalian genomes. Phylogenetic and genomic analysis of EVEs across multiple host species revealed novel information about the origin and evolution of diverse virus groups. Furthermore, several of the elements identified here encode intact open reading frames or are expressed as mRNA. For one element in the primate lineage, we provide statistically robust evidence for exaptation. Our findings establish that genetic material derived from all known viral genome types and replication strategies can enter the animal germ line, greatly broadening the scope of paleovirological studies and indicating a more significant evolutionary role for gene flow from virus to animal genomes than has previously been recognized.


Introduction
Viral infection of germ line cells (i.e. gametes, or cells of the early embryo) can lead to viral genes or genomes becoming integrated into chromosomes and inherited as host alleles [1,2]. These insertions, which we refer to here as endogenous viral elements (EVEs), are usually eliminated from the host gene pool within a small number of generations. However, they can also increase in frequency, and some eventually reach fixation [3][4][5][6][7][8][9][10][11].
In animal genomes, the majority of EVEs are derived from reverse transcribing RNA (rtRNA) viruses (i.e. retroviruses) [5,12,13]. Retroviruses are the only animal viruses that integrate into the genome of the host cell as an obligate step in their replication strategy, and are thus predisposed to enter the host germ line (Figure 1). EVEs derived from viruses that use other genome replication strategies also occur, but are much less common [6,7,9,11,14,15]. Genomic integration of non-retroviral viruses may be mediated by non-homologous recombination with chromosomal DNA [16][17][18] or by interactions with retroelements in the host cell [11,[19][20][21][22] (Figure 1).
EVEs reveal complex evolutionary relationships between viruses and their hosts. For example, endogenous retroviruses have shaped vertebrate genome evolution, not only by acting as genetic parasites [23,24], but also by introducing useful genetic novelty. Indeed, the role of exapted retroviral genes (i.e. integrated retroviral genes that have adapted to serve a function in the host genome) in mammalian reproduction [25,26] identifies EVEs as a key factor in the evolution of placental mammals from egg-laying ancestors. Similarly, in parasitoid wasps, genes derived from ancestral nudiviruses have been exapted to facilitate a parasitic lifestyle [9]. These remarkable examples demonstrate an important role for gene flow from viruses to hosts in animal evolution.
EVEs also constitute an invaluable resource for reconstructing the long-term history of virus and host evolution [27,28]. Viruses exhibit the potential for extremely high rates of nucleotide substitution, host switching, and lineage extinction, and this sets limitations on what can be reliably inferred from observations of contemporary isolates [29,30]. EVE sequences effectively represent the 'molecular fossils' of ancient viral genomes, preserving information about ancient virus and host interactions that would otherwise be difficult, if not impossible, to infer. For example, EVEs are subject to host rates of evolution and can thus be dated relatively reliably with molecular clock-based approaches, in which genetic divergence correlates linearly with time [31]. In contrast, structural constraints in exogenous viruses may lead to the decoupling of short and long-term rates of viral evolution, rendering molecular clock assumptions unusable over longer timescales [30,[32][33][34]. Furthermore, the identification of orthologous EVE insertions allows the incorporation of independent age estimates based on host species divergences (see Figure 1) [35].
Despite the large quantity of published genome sequence data, the diversity of non-retroviral viruses in animal genomes has not been systematically explored. In this report, we use an in silico approach to screen the genomes of mammals, birds and insect vector species for endogenous sequences derived from nonretroviral mammalian viruses. We identify sequences derived from a very broad range of viruses, revealing an extensive history of non-retroviral genome invasion ranging back to at least the late Mesozoic Era (,93 million years ago). We demonstrate that these sequences can be highly informative; (i) revealing novel virus diversity; (ii) providing a timescale for virus evolution; (iii) indicating the likely host range of virus groups, and; (iv) identifying rare instances of horizontal transmission. Furthermore, using a novel approach, we show that at least some of the EVE sequences identified here are likely to have been exapted during their evolution. The implications of these findings are discussed.

Genome screening
An algorithm for in silico screening of genomes for endogenous non-retroviral insertions was developed. We selected all nonretroviral virus genera that infect mammals, and constructed a library of representative peptide sequences (restricted to viruses with typical genome sizes of ,100 Kilobases (Kb)) ( Table S2). The tBLASTn program was used to screen low coverage and complete genome assemblies for sequences exhibiting similarity to viral peptides in this library. We screened the genomes of likely reservoirs (birds, n = 2) and vector species (mosquitoes, n = 3; ticks, n = 1) as well as all available mammal genomes (n = 44) (Table S1). Sequences that matched viral peptides with e-values ,0.001 were extracted (along with flanking sequences) and putative protein sequences were inferred through a combination of automated and manual alignment. These sequences were assigned to taxonomic groups (family, genus) based on the most closely related exogenous viral sequences in searches of PFAM and Genbank databases (Tables S3, S4, S5, S6, S7). For EVEs that were found to encode uninterrupted open reading frames (ORFs), putative protein sequences were used with the tBLASTn program to search expressed sequence tag (EST) databases for the corresponding Figure 1. Viral replication strategies, endogenous viral elements, and the genomic fossil record. Animal viruses exhibit a range of genome types and replication strategies. While all viruses must produce mRNA in order to express proteins, steps between entry into the cell and the expression of mRNA vary greatly. Examples of the known animal virus replication strategies are shown to the left of the figure, with the representative families listed for each case. Arrows indicate steps in replication. Red lines indicate pathways that lead to viral genetic material becoming integrated into the nuclear genome of the host cell. Retroviruses are unique amongst animal viruses in that integration occurs as an obligate step in replication. For all other animal viruses integration occurs anomalously, through interaction with cellular retroelements such as LINEs, or via non-homologous recombination with genomic DNA. If integration occurs in a germ line cell that goes on to develop into a viable host organism, an EVE is formed. Green lines show the evolution of an EVE in its host lineage. In the example given, the EVE reaches genetic fixation at the point indicated, and is inherited by all descendant hosts thereafter. Assuming that insertion occurs randomly, the presence of related EVEs at the same locus in both descendant species A and B indicates that insertion occurred prior to their divergence, allowing a minimum age for the insertion to be inferred from the estimated timescale of their evolution. Conversely, the presence of an empty insertion site in species C provides a maximum age for the insertion. Abbreviations: dsDNA (double stranded DNA); ssDNA (single stranded DNA, dsRNA (double stranded RNA); RNA-ve (negative sense, single stranded RNA); RNA-ve (negative sense, single stranded RNA); RNA+ve (positive sense, single stranded RNA). doi:10.1371/journal.pgen.1001191.g001

Author Summary
The presence of retrovirus sequences in animal genomes has been recognized since the 1970s, but is readily explained by the fact that these viruses integrate into chromosomal DNA as part of their normal replication cycle. Unexpectedly, however, we identified a large and diverse population of sequences in animal genomes that are derived from non-retroviral viruses. Analysis of these sequences-which represent all known virus genome types and replication strategies-reveals new information about the evolutionary history of viruses, in many cases providing the first and only direct evidence for their ancient origins. Additionally, we provide evidence that the functionality of one of these sequences has been maintained in the host genome over many millions of years, raising the possibility that captured viral sequences may have played a larger than expected role in host evolution.
mRNA. For all EVEs disclosing similarity to contemporary virus isolates, putative EVE protein sequences were aligned with representative viral protein sequences, and maximum likelihood phylogenies were constructed.

EVEs related to viruses with RNA genomes
We identified numerous, highly significant matches (i.e. e-values ,1610 29 ) to RNA viruses in the genomes of mammals and insect vectors (Table 1, Tables S3, S4, S5). EVEs related to a total of seven families were identified including double stranded RNA (dsRNA) viruses (Reoviridae) and positive sense RNA (RNA+ve) viruses (Flaviviridae), as well as both segmented (Orthomyxoviridae, Bunyaviridae) and non-segmented (Borna-, Filoand Rhabdoviridae) families of negative sense RNA (RNA-ve) viruses. Consistent with an integration process involving viral mRNA (rather than genomic RNA), all EVEs derived from RNA viruses had genetic structures that spanned a single viral transcript (or fragments derived from single transcripts). EVEs derived from different genes never occurred as contiguous sequences, and consequently we could not determine whether EVEs derived from distinct genes of a given virus family originated from the same or distinct virus lineages/infections.
In mammals, matches to RNA virus proteins that spanned complete genes were typically flanked by target site duplications (TSDs) and 39 poly-A tails, consistent with LINE-mediated retrotransposition of viral mRNAs [36]. In insects, similar features were not apparent for any EVE insertion, even when the boundaries of host and viral sequences were clearly identifiable ( Figure S1). Notably, putative 39 poly-A tails could be identified in the expected position for some mammal genome sequences that matched only weakly to RNA virus peptides, suggesting the presence of EVEs at the limit of detection to our search strategy.
EVEs related to RNA-ve viruses. Numerous EVE sequences disclosing similarity to proteins derived from nonsegmented RNA-ve virus families in the order Mononegavirales (Borna, Filo and Rhabdoviridae) were identified. Matches to genes encoding the relatively conserved nucleoprotein (NP) and Lpolymerase proteins predominated, but matches to more rapidly evolving glycoproteins were also identified ( Figure 2a). A subset of RNA-virus related EVEs identified in the wallaby genome (Table 1) exhibited significant similarity to a mononegaviral RNA polymerase but were only distantly related to any known group and were not analyzed further.
Sequences disclosing similarity to bornavirus proteins were identified in 16 mammalian genomes. In addition to numerous EVEs derived from bornavirus NP genes (some of which have previously been reported as endogenous Borna-like N (EBLN) elements [11]), we identified EVEs derived from bornavirus Lpolymerase, matrix (M) and glycoprotein (GP) genes ( Figure 2a). Several sets of orthologous insertions were identified; (i) four distinct NP insertions in strepsirrhine primate genomes (previously been reported as EBLN 1-4 [11]); (ii) GP in strepsirrhine primates; (ii) L-polymerase in rats and mice; (iii) NP in three Afrotherian species (African elephant, cape hyrax, lesser hedgehog tenrec (Echinops telfairi)). Bornavirus nucleoprotein trees revealed an expansion of EBLNs in the New World branch of strepsirrhine primates, represented by the common marmoset (Callithrix jacchus) (Figure 2b).
Sequences matching filovirus NP genes were identified in the little brown bat (Myotis lucifugus), Tammar wallaby (Macropus eugenii), and gray short-tailed opossum (Monodelphis domestica) genomes. The majority of these matches comprised fragments of genes, although two full-length NP gene EVEs (displaying poly-A tails and TSDs) were identified in bats ( Figure 2a). More divergent fragments of the NP gene were identified in the kangaroo rat (Dipodymys ordii), guinea pig (Cavia porcellus) and common shrew (Sorex araneus) genomes. Additionally, fragments of the L-polymerase and VP35 gene were identified in the genomes of the opossum and Philippine tarsier (Tarsius syrichta) respectively. In phylogenies, EVEs derived from filovirus NP genes grouped into two wellsupported clades (Figure 2c), the largest of which included exogenous filoviruses and EVEs derived from marsupials (wallaby and opossum). EVEs in the smaller clade were more distantly related to extant filoviruses, and were derived from the little brown bat, guinea pig, shrew and kangaroo rat genomes. An opossum EVE derived from L-polymerase grouped relatively closely with Marburgviruses. Conflicting phylogenetic trees for oppossum Lpolymerase and NP-derived EVEs strongly indicated they are derived from distinct ancestral viruses. EVEs related to rhabdoviruses were identified in the blacklegged tick genome (Ixodes scapularis), and in the genomes of both Aedes and Culex mosquitoes. Among these were insertions that encoded intact NP and GP ORFs (Figure 2a). Phylogenies constructed using NP tentatively grouped rhabdovirus EVEs derived from mosquitoes in a clade with lyssaviruses and Drosophila sigma virus ( Figure 2d). However phylogenetic support for this clade was very weak, with only 20% bootstrap support for the monophyly of the clade, although support for the grouping of EVEs from Aedes and Culex was high (96%). A robust clade (100% bootstrap support) placed four Ixodes EVEs into a single group, suggesting they are likely derived from the same exogenous virus lineage, but their placement relative to other Rhabdoviruses was ambiguous, as they formed a clade with a number of distinct Rhabdoviruses with minimal bootstrap support (8%). Phylogenies constructed using L-polymerase sequences weakly grouped Ixodes and Aedes insertions with Lyssaviruses and Moussa virus, but not Drosophila sigma virus. Weak support for basal relationships was obtained with both trees, making it difficult to confidently place thenovel EVEs with respect to the known rhabdovirus diversity.
Matches to RNA-ve viruses with segmented genomes were identified in the genomes of insect vectors (Figure 3a). In the I. scapularis genome, we identified EVEs related to viruses isolated from ticks and birds basal to the proposed genus Quarjavirus (including Quaranfil and Johnston Atoll viruses) [37] in the family Orthomyxoviridae (Figure 3c), EVEs distantly related to the Bunyaviridae (Phlebovirus and Nairovirus genera) were identified in the I. scapularis genome. Nairovirus-derived EVEs were distantly related to Hazara virus (Figure 3d), indicating they represent a distinct lineage within this tick-vectored genus. Phlebovirus EVEs formed a robustly supported cluster in phylogenies with exogenous viruses (Figure 3e), closest to Uukuniemi and Catch-me cave viruses (vectored by ticks and mosquitoes respectively), suggesting they are derived from the same exogenous lineage.
EVEs related to dsRNA viruses. In the A. aegyptii genome we identified an EVE that was very closely related (,98% nucleotide sequence identity) to segment 5 of the Liaoning virus genome ( Figure S1, Figure 3a and 3b). Liaoning is a dsRNA virus (family Reoviridae, genus Seadornavirus) that was recently isolated from Aedes dorsalis mosquitoes [38]. The Liaoning EVE in A. aegyptii had a large inframe deletion, but encoded an otherwise intact ORF. This is the first EVE derived from a dsRNA virus to be described. As with other RNA virus EVEs in insect genomes, the mechanism of genomic integration was unclear. The intact ORFs and high level of identity to a circulating virus raise the possibility this EVE formed recently and is not fixed in the host population.
EVEs related to RNA+ve viruses. The genome of the Aedes mosquito contains several sequences exhibiting similarity to the viruses of the RNA+ve family Flaviviridae. Endogenous flaviviruses have previously been reported in the genomes of A. aegyptii and A. albopictus mosquitoes [6,39], but complete putative genomic structures have not been determined. In particular, a large fragment spanning the flaviviral NS1, NS2A, NS2B, NS3 and NS4A genes has been described in A. albopictus, and a range of smaller fragments at the 39 end of the flaviviral genome, mostly of the NS5 gene, have been described in both albopictus and aegyptii species. We have identified fragments that together span almost the entire flavivirus genome in A. aegyptii (based on alignment to Kamiti river virus (Figure 4a)), including a single fragment that spans the equivalent region from the albopictus genome. Phylogenetic trees that included both aegyptii and albopictus EVEs showed that they are distinct viruses, separated by known exogenous isolates. Thus, the EVEs in these two mosquito species appear to be derived from at least two distinct flavivirus lineages, with the aegyptii virus being most divergent from previously characterized isolates. The albopictus sequence grouped in a clade that included both Kamiti river virus and cell fusing agent, as previously described by Crochu et al. [6].

EVEs related to viruses with DNA genomes
We identified highly significant matches to three families of viruses with DNA genomes in the genomes of mammals and birds ( Table 1, Tables S6 and S7). These included matches to two single stranded DNA (ssDNA) virus families (Parvoviridae and Circoviridaethe first ssDNA virus EVEs to be described in mammals -and one family of reverse transcribing DNA (rtDNA) viruses (Hepadnaviridae) -the first rtDNA EVEs to be described. A single match to a double stranded DNA (dsDNA) virus family (Adenoviridae) was identified in the kangaroo rat genome, but this sequence was unambiguously viral across its entire length (,17 Kb), encoding thirteen completely intact viral ORFs ( Figure S2), and is thus likely to have derived from free virus and not an EVE.
A subset of parvovirus-related EVEs represented complete or nearly complete viral genomes (Figure 5a). For one insertion in the M. lucifugus genome, we identified putative 59 and 39 terminal noncoding regions encoding characteristic inverted terminal repeats ( Figure S3). In general, however, DNA virus EVEs occurred as genomic fragments, with no particular region of the viral genome being obviously favored, with the exception of the circoviruses, for which only the Rep gene was found.
EVEs related to rtDNA viruses. We identified sequences disclosing significant similarity to Hepadnaviruses in the genomes of the black-legged tick (I. scapularis), and the zebrafinch (Taeniopygia guttata). In the zebrafinch genome, a total of 7 loci were identified on 7 distinct chromosomes (Figure 4c). Sequences at each locus generally corresponded to distinct, non-overlapping regions of viral genome, suggesting that host genome arrangements had fragmented a more complete insertion, spanning most if not all of the viral genome. A consensus constructed using all 7 zebrafinch hepadnavirus EVE insertions, and representing ,80% of the viral genome, grouped with avian hepadnaviruses in maximum likelihood phylogenies (Figure 4d). Although only distantly related to vertebrate hepadnaviruses (and hence not included in phylogenies), matches in the tick genome indicate the existence of an uncharacterized lineage of insect hepadnaviruses.
EVEs related to ssDNA viruses. EVEs derived from viruses of the family Parvoviridae were identified in a broad range of mammalian genomes (Figure 5a). In total, 58 EVEs in 17 species matched closely to the Dependovirus genus, 41 EVEs in 5 species matched the Parvovirus genus, and a single element in the cape hyrax (Procavia capensis) genome matched the Amdovirus genus ( Table 1, Table S6). Phylogenies confirmed the majority of these designations, grouping EVEs robustly within the diversity of genera to which they were assigned. However, a group of EVEs identified in the Tammar wallaby, opossum, and guinea pig genomes formed a distinct and well-supported clade, potentially representing a novel genus, intermediate between the Parvovirus and Amdovirus genera (Figure 5e).
The majority of parvovirus EVEs were not intact, and are unlikely to express RNA or protein. However, a dependovirus EVE in the genome of the African elephant (Loxodonta africana) encoded an intact NS1 gene (Figure 5a). Additionally, screening of EST databases identified expressed sequences related to an Adeno-associated virus 2 (AAV-2) integrates at a specific site in human chromosome 19 [40,41]. Notably, a nearly complete dependovirus EVE identified in the baboon genome, and that grouped closely with AAV-2 in phylogenies (Figure 5e), was inserted at an unambiguously distinct site homologous to human chromosome 21. We identified a dependovirus insertion that was orthologous between rats and mice (Table S6). This insertion comprised fragments of the NS1 gene, with internal breakpoints being bounded by homologous genomic DNA sequences -thus the possibility of two separate, but site-specific integration events could be excluded.
EVEs disclosing significant similarity to the Rep gene of ssDNA family Circoviridae were identified in the genomes of the gray shorttailed opossum, and three species of the mammalian order Carnivora; domestic cat (Felis catus), dog (Canis familiaris) and panda (Ailuropoda melanoleuca) (Figure 5c). These sequences grouped with other, recently characterized mammalian circoviruses in phylogenies (Figure 5f). One circovirus insertion was found to be orthologous in all three carnivore species. Host genomic DNA was identified at the 39 end of the Rep gene, indicating that the downstream Cap gene was absent from these EVEs.

Paleovirology of EVE insertions
We identified a number of EVE insertions that were orthologous between species, allowing minimum ages for families to be inferred from host divergence dates (see Figure 2). Using previously estimated mammalian divergence dates [42] we obtained minimum ages for the Parvo, Circo and Bornaviridae of 30, 68 and 93 million years respectively, demonstrating the ancient origins of these families ( Figure 6). During completion of this manuscript, orthologous filovirus EVEs were reported in the mouse and rat genomes [22]. These sequences were identified by BLAST searching using EVEs as probes, and were not picked up in our screen, which relied on matches to exogenous viruses. On the basis of the mammalian divergence dates used here [42], these EVEs provide a minimum age of 30 million years for the Filoviridae (Figure 6).
The EVEs identified in this study extend the host range of several families (Parvo-, Circo-, Hepadna, Bornaand Filoviridae) with respect to their known range as exogenous viruses ( Figure 6). Dependovirus EVEs are particularly widespread and occur in diverse mammalian hosts, despite their apparent low probability of germ line integration in AAV-derived gene therapy vector in vivo models [43]. Filoviruses have only been identified as exogenous infections in bats and primates [44]. However, filoviruses EVEs were identified not only in North American bats (M. lucifugus) and Asian primates (tarsier), but also in insectivores, rodents, and in both South American and Australian mammals ( Figure 6). In concordance with the recent identification of Ebola Reston in swine [45], this unexpected result indicates that the distribution of filoviruses is likely much broader than has previously been recognized.  Highly discordant host ranges among closely related EVEs (or EVEs and exogenous viruses) can provide information about transmission events. In this regard, we note that a dependovirus EVE in the bottlenose dolphin (Tursiops truncatus) genome grouped robustly with avian dependoviruses (rather than mammalian isolates) in NS1 trees (Figure 6d), suggesting cross-class transmission of parvoviruses between birds and mammals may have occurred in the past.

Evidence for exaptation of EVEs
EVEs that are neutral or only slightly deleterious in their hosts may fortuitously drift or hitchhike [46] to fixation, accumulating mutations at the host neutral rate. Alternatively, EVE insertions may confer an advantageous phenotype on the host and spread through the population by selection. In such exapted sequences, selection will act to maintain the functionality of the EVE sequence. Many of the EVEs identified in this study were highly mutated and/or fragmented and these likely represent nonfunctional, neutrally evolving pseudogenes. However, several EVEs encoded intact ORFs, and some also express RNA (Figure 2a, Figure 3a, Figure 5a). For most of these EVEs, the time since insertion is unknown, and intact ORFs could reflect recent insertion rather than a long-standing history of purifying selection within the host genome. In primates, however, orthology of the bornavirus-derived insert EBLN-1, which is intact in several species, demonstrates an insertion date predating the divergence of strepsirhine primates (,54 million years ago (MYA)) ( Figure 7). Simulations in which a consensus derived from all EBLN-1 sequences was allowed to neutrally evolve over this time period indicated the probability of maintaining an intact ORF in the absence of purifying selection was ,0.00001 (100,000 replicates, mean number of stop codons = 15.57, 95% confidence range 7.9-23.3). This analysis provides more robust support for purifying selection than classical tests based on the ratio of synonymous to non-synonymous mutations (which are weakly significant for EBLN-1 [11]), strongly indicating that EBLN-1 has been exapted in the primate genome, at least during part of its evolutionary history. Curiously, however, EBLN-1 has not retained coding capacity in all primate species. Perhaps selection to maintain it has recently been lost across all primates, and all the inserts may become inactivated in future.

Discussion
In this report, systematic screening revealed that sequences derived from a broad range of non-retroviral mammalian virus groups occur as endogenous elements in the genomes of mammals, birds and insect vectors. We describe the first EVEs derived from the rtDNA and dsRNA groups, thereby establishing that the complete range of known animal virus replication strategies (see Figure 1) are represented by endogenous elements in animal genomes.
Richer sampling of animal genomes is likely to reveal an even greater diversity of EVEs. While EVEs that are very ancient (i.e. that inserted prior to the divergence of major host lineages) can be identified by selectively screening a small number of host species, identification of more recent insertions will often require richer sampling within orders and genera. Sampling of mammalian species for whole genome sequencing has generally been across, rather than within orders (primates are an exception). Consequently the majority of mammal species sampled in this study diverged more than 50 million years ago ( Figure 6). Any mammal species that was not sampled, and diverged more recently, could contain uncharacterized EVEs. Sampling of avian and insect vector genomes has so far been quite limited, and these may also harbor a rich virus fossil history. Furthermore, the vast majority of EVE insertions never reach fixation, and there are likely many unfixed EVEs present within species gene pools at a given time (known examples of unfixed EVEs include Israeli acute paralysis virus (IAPPV) insertions in honey bees (Apis mellifera) [15], koala endogenous retrovirus (KoRV) in koala bears [47], and human herpesvirus 6 (HHV-6) and HERV-K HML-2 insertions in humans [18,48]). Identification of such unfixed EVEs will often require population-level screening.
The in silico screening strategy employed here likely underestimates the actual diversity of EVEs for several reasons. Firstly, only low-coverage, incomplete genome data were available for most species. Furthermore, EVEs within the data we screened could have been overlooked because (i) screening was based on similarity searches, and is thus dependent on current (limited) knowledge of viral diversity, and (ii) more ancient EVEs may not be identified due to the divergence in both host and virus lineages subsequent to insertion (this may also result in a bias toward detecting more conserved genes).
Certain groups of (non-retroviral) viruses appear to be better represented in the genomic fossil record than others (e.g. Parvoviridae, Mononegavirales). This likely reflects a predisposition for germline integration among viruses with particular patterns of replication and infection. Notably, viruses that establish persistent infections and/or replicate within the nucleus are particularly well represented among the EVEs identified in this study. Nevertheless, these characteristics do not appear to be prerequisites for germ line integration (Table 1). Indeed, since retroelements are ubiquitous in animal genomes, and replication of all known viruses requires the expression of RNA, retroelement activity in germ line cells [49] may present a general mechanism for mediating insertion of virus genes into animal germ lines (see Figure 1).
The discovery that a broad range of viruses are represented by EVEs in animal genomes indicates that viral 'molecular fossils' can provide the basis for robust, time-scaled, macroevolutionary studies across a range of animal and virus groups. For example, EVE sequences can be combined with phylogenetic data of extant host species to reveal patterns of inter-class virus transmission ( Figure 5) [50]. In this study, orthologous EVEs derived from the Borna-, Filo-Circo-, and Parvoviridae provided direct evidence for the ancient origins of these families ( Figure 6). These findings also indicate that more recent dates of origin obtained for other virus genomic contigs; 2 Structures represent a composite of semi-overlapping fragments; 3 Element has undergone genomic rearrangements, with the arrow indicating the direction of the rearranged fragment. (b) Phylogenetic relationships of dependovirus EVEs and representative exogenous viruses, based on NS1 gene and rooted on snake parvovirus. A dolphin EVE (indicated by an asterisk) groups robustly with avian rather than mammalian isolates. (d) Phylogenetic relationships of parvovirus EVEs and representative exogenous viruses, based on NS1 gene and rooted on Aleutian mink disease virus. Support for both ML phylogenetic trees was evaluated using 1,000 nonparametric bootstrap replicates. EVEs potentially comprising a new genus are indicated. (d) Phylogenetic relationships of circovirus EVEs and representative exogenous viruses, based on the Rep gene and rooted on avian circoviruses, with support for the ML phylogenetic tree evaluated using 1,000 nonparametric bootstrap replicates. Taxa that are shown as genetic structures in (a) are indicated by colored squares (EVEs) and circles (exogenous viruses) (a). Abbreviations: AAV = adeno-associated virus; MMV = minute virus of mice, AMDV = Aleutian mink disease virus, PV = parvovirus, CV = circovirus, PCV = porcine circovirus; HSAV = Human stoolassociated circular virus. doi:10.1371/journal.pgen.1001191.g005 Figure 6. Timescaled phylogenetic tree of mammals screened in this study (after Bininda-Emonds et al [42]) showing the known distribution of EVEs and of exogenous Borna-, Filo-, Circo-, and Parvoviruses. Grey circles indicate nodes at which orthologous EVE insertions were identified. For all orthologous insertions identified here and elsewhere [11,22], the virus family and genomic region represented by the ortholog is shown. Abbreviation: TLS = Thirteen-lined ground squirrel. doi:10.1371/journal.pgen.1001191.g006 families using molecular clock-based extrapolations are artifacts [30].
The diversity represented by known virus isolates represents a tiny fraction of the total viral diversity. Indeed, given their likely ancient origins, many virus families may be broadly distributed across mammalian hosts. This was reflected in viral phylogenies containing a mixture of EVEs and exogenous viruses -closely related exogenous relatives could often not be identified, or had only been recently characterized [37,38,51] (Figure 2, Figure 3, Figure 5). These findings suggest that EVEs can inform viral surveillance efforts by revealing novel virus diversity and indicating the likely host range of virus groups (particularly if they inserted relatively recently). For example, a strong association between filoviruses and marsupials (Table 1, Figure 2) unexpectedly highlighted this group as a potential filovirus reservoir. The potential presence of EVEs may also be an important consideration in studies where bulk sequencing of environmental samples is used to identify novel virus groups [51][52][53].
EVEs that reach fixation in the host germ line may do so fortuitously, or because they are exapted by the host genome. Monte Carlo simulations provided robust statistical support for a history of purifying selection in the primate EVE EBLN-1, indicating this sequence has been exapted by the primate genome. However, selection on EBLN-1 has clearly relaxed in some primates and may also have relaxed in humans (Figure 7). Such transient co-option may be expected for EVEs that function as restriction factors in their hosts by conferring resistance to infection by exogenous viruses. Several examples of this phenomenon have been described in animals [15,54,55], and it is likely one of the most common exaptations of viral genes by host organisms [56,57]. In these cases, counter-adaptation in a rapidly evolving virus population may eventually render the EVE restriction mechanism non-functional [55], causing selection to relax. Importantly, the rate at which EVEs are exapted as restriction factors in animals could greatly exceed their rate of fixation in animal genomes.
The diverse EVE sequences described in this report demonstrate an extensive history of gene flow from virus to animal genomes. Animal genomes are a living document of virus and host interaction, and genomic studies have an important role to play in advancing understanding of virus and host evolution.

Genome screening
Chromosome assemblies and whole genome shotgun assemblies of 44 species (Table S1) were screened in silico using tBLASTn and a library of representative peptide sequences derived from mammalian virus groups with genomes ,100 Kb in total length (selected from the 2009 International Committee on Taxonomy of Viruses (ICTV) master species list (Table S2)). Host genome sequences spanning high-identity (i.e. e-values ,0.0001) matches to viral peptides were extracted, and a putative viral ORF was inferred using BlastAlign [58] and manual editing. Putative EVE peptides were then used to screen the Genbank non-redundant (nr) database in a reciprocal tBLASTn search. Matches to retroviruses, viral cloning vectors, and non-specific matches to host loci were filtered and discarded. The remaining sequences were considered viral if they unambiguously matched viral proteins in the Genbank and PFAM databases as shown in Tables  S3, S4, S5, S6, S7. Genetic structures for these elements were determined by comparison of the putative EVE peptide sequence to the nucleotide sequence of a viral type species representing the most closely related viral genus recognized by ICTV. Boundaries between viral and genomic regions were identified by analysis of sequences flanking matches to viral peptides, the genomes of the host species, and closely related host species. Sequences that flanked viral insertions were considered genomic if they; (i) were present as empty insertion sites in a related host species; (ii) disclosed highly significant similarity (i.e. e-values ,1610 29 ) to host proteins; or (iii) non-viral and highly repetitive (.50 copies per host genome). Insertions were considered endogenous when .100 bp of genomic flanking sequence could be identified either side of a viral match. Insertions for which .100 bp of unambiguous (i.e. .80% nucleotide identity) flanking sequence was identified in host sister taxa were considered orthologous  Figure 6 is shown in greater detail here, with the number of stop codons in the EBLN-1 locus indicated for seven species. Orthology across these species indicates that EBLN-1 predates the divergence of these species 54 million years ago [42]. Monte Carlo simulations in which a consensus EBLN sequence was allowed to neutrally evolve at the primate neutral rate for this length of time showed that the average number of stop codons expected after this time is fifteen. (b) The distribution of the number of stop codons from 100,000 simulation replicates. Confidence intervals are indicated. doi:10.1371/journal.pgen.1001191.g007 insertions. PERL scripts were used to automate BLAST searches and sequence extraction. Putative EVE peptide sequences, and alignments of EVEs and exogenous retroviruses, are available online (http://saturn.adarc.org/paleo/).

Phylogenetic analysis
Putative EVE sequences inferred using BlastAlign were aligned with closely related viruses using MUSCLE and manually edited [59]. Maximum likelihood (ML) phylogenies were estimated using amino acid sequence alignments with RAXML [60], implementing in each case the best fitting substitution model as determined by ProtTest [61]. Support for the ML trees was evaluated with 1000 nonparametric bootstrap replicates. The best fitting models for the datasets were

Simulation
A Monte Carlo simulation procedure was employed to determine the probability that the bornavirus-derived element EBLN-1 has retained coding capacity over 54.1 million years under neutral evolution (i.e. not under purifying selection). A consensus EBLN-1 sequence was inferred, and the effects of neutral evolution were simulated using seq-gen [62] for a branch length equivalent to the minimum amount of time that EBLN-1 orthologs have resided in primate genomes, based on the primate divergences estimated by Bininda-Emonds et al [42], and given a neutral rate of evolution of 2.2610-9 [12]. The number of stop codons accrued was counted for 100,000 iterations of the simulation. The probability that the reading frame could have remained open under neutrality is given by the number of replicates under which no stop codons have evolved, divided by the number of iterations.