Molecular fossils illuminate the evolution of retroviruses following a macroevolutionary transition from land to water

The ancestor of cetaceans underwent a macroevolutionary transition from land to water early in the Eocene Period >50 million years ago. However, little is known about how diverse retroviruses evolved during this shift from terrestrial to aquatic environments. Did retroviruses transition into water accompanying their hosts? Did retroviruses infect cetaceans through cross-species transmission after cetaceans invaded the aquatic environments? Endogenous retroviruses (ERVs) provide important molecular fossils for tracing the evolution of retroviruses during this macroevolutionary transition. Here, we use a phylogenomic approach to study the origin and evolution of ERVs in cetaceans. We identify a total of 8,724 ERVs within the genomes of 25 cetaceans, and phylogenetic analyses suggest these ERVs cluster into 315 independent lineages, each of which represents one or more independent endogenization events. We find that cetacean ERVs originated through two possible routes. 298 ERV lineages may derive from retrovirus endogenization that occurred before or during the transition from land to water of cetaceans, and most of these cetacean ERVs were reaching evolutionary dead-ends. 17 ERV lineages are likely to arise from independent retrovirus endogenization events that occurred after the split of mysticetes and odontocetes, indicating that diverse retroviruses infected cetaceans through cross-species transmission from non-cetacean mammals after the transition to aquatic life of cetaceans. Both integration time and synteny analyses support the recent or ongoing activity of multiple retroviral lineages in cetaceans, some of which proliferated into hundreds of copies within the host genomes. Although ERVs only recorded a proportion of past retroviral infections, our findings illuminate the complex evolution of retroviruses during one of the most marked macroevolutionary transitions in vertebrate history.


Introduction
The ancestors of modern cetaceans (whales, dolphins, and porpoises) underwent a macroevolutionary transition from terrestrial to aquatic environments early in the Eocene >50 million years ago [1][2][3][4]. Pakicetids, the earliest known cetaceans that existed in the early Eocene, are like to be aquatic waders [3,4]. During the transition from land to water, cetaceans evolved a range of morphological and behavioral innovations, including streamlined bodies, filter-feeding, echolocation, as well as loss of hindlimbs, body hair, and dermal glands [1,2,5]. Phylogenetic analyses reveal that cetaceans are closely related to and fall within the diversity of eventoed ungulates (Artiodactyla) [6]. Therefore, Cetacea and Artiodactyla have been sometimes united into Cetartiodactyla [6]. Within Cetartiodactyla, Hippopotamidae has been placed to be the sister group of Cetacea [6]. Modern cetacean species can be further divided into two clades, namely mysticetes (baleen whales) and odontocetes (toothed whales) [3,7]. Mysticetes and odontocetes have been estimated to diverge from each other in the Late Eocene (~36 million year ago) [6].
Retroviruses have been known to infect vertebrates, including cetaceans [22][23][24][25][26][27]. The replication of retroviruses requires reverse transcription and integration of viral genomes into host genomes. On occasion, retroviruses infect germ line cells, and the integrated retroviruses may become vertically inherited, forming the so-called endogenous retroviruses (ERVs) [25][26][27]. ERVs recorded past retroviral infections, providing molecular fossils for studying the macroevolution of retroviruses. Therefore, ERVs represent a unique resource to explore the evolution of retroviruses during the macroevolutionary transition from land to water of cetaceans.
ERVs proliferate within the host genomes through three modes: ERVs in germ line cells or somatic cells produce virus particles to infect germ line cells, namely reinfection [31,32]; ERVs can also increase in copy number within the cell either by retrotransposition in cis (viruses use their own proteins for mobilization) or by complementation in trans (viruses use proteins produced by other transposable elements within the same cell) [33][34][35]. Reinfection can also occur by complementation in trans, that is, retroviruses without functional env genes can produce virus particles to infect germline cells by "hitchhiking" env gene of other retroviruses [36]. Reinfection requires the three core genes (gag, pol, and env) to be functional, and thus all the three core genes are subject to purifying selection, as indicated by a nonsynonymous to synonymous substitution rate ratio (dN/dS) of < 1 [31,32]. ERV transposition in cis does not require a functional env gene, and ERV proliferation by complementation in trans does not require any functional gene of its own [31,32]. Different ERVs increase in copy number through different ways; for example, while human ERV family HERV-K (HML2) members proliferate mainly by reinfection [31], intracisternal A-type particles (IAPs) proliferate mainly by retrotransposition in cis [36].
In this study, we used a phylogenomic approach to trace the origin and evolution of ERVs along the course of cetacean evolution, and identified a total of 8,724 ERVs in 25 cetacean genomes, which cluster into 315 distinct ERV lineages. We hypothesize that cetacean ERVs originated through two possible routes, through either land-to-water transition or secondary host switching. Our study provides novel insights into the evolution of retroviruses during one of the most remarkable macroevolutionary transitions in vertebrate history.

Identification and classification of ERVs in cetaceans
To explore the evolution of retroviruses in cetaceans, we used a similarity search and phylogenetic analysis combined approach to systematically identify ERVs within the genomes of 25 cetaceans, including 6 mysticetes and 19 odontocetes (S1 Table). We found the presence of ERVs in all the cetaceans, and identified a total of 8,724 ERVs, which is consistent with the ubiquitous distribution of ERVs in vertebrates [26,27]. The copy numbers of ERVs in cetaceans are relatively low, varying from 222 in Physeter catodon to 627 in Lagenorhynchus obliquidens. However, the estimates of ERV copy numbers should be taken with caution, because the quality and completeness of genome assemblies might affect the number of ERVs detected [37]. Moreover, our mining approach is based on reverse transcriptase (RT) proteins, and fragmented ERVs without RT proteins might not be identified.
Next, we performed a large-scale phylogenetic analysis of cetacean ERVs, representative vertebrate ERVs, and representative exogenous retroviruses to identify distinct ERV lineages in cetaceans. Based on the phylogenetic analyses and host species distribution of ERVs, the cetacean ERVs identified were classified into 315 distinct lineages (S1 and S2 Figs), including an ERV closely related to deltaretroviruses within the genome of Platanista minor as previously described [22]. To confirm the classification of cetacean ERV lineages and investigate the origin of cetacean retroviruses, we further screened for ERVs that are closely related to each ERV lineage within the vertebrate genomes. We found that each cetacean retroviral lineage identified in this study forms a monophyletic group and nests within the diversity of retroviruses from non-cetacean mammals, suggesting that each cetacean ERV lineage represents one or more independent invasion events (S2 Fig). Phylogenetic analysis shows that lineages 1 to 282 belong to Class I ERVs, among which lineages 1 to 123 and 124 to 282 are closely related to gammaretroviruses and epsilonretroviruses, respectively (S1 Fig). Lineages 283 to 304 belong to Class III ERVs, and lineages 305 to 315 belong to Class II ERVs (S1 Fig). Taken together, these results suggest a wide variety of retroviruses infected cetaceans and/or their ancestors.

Scenarios of retrovirus evolution in cetaceans
We hypothesize that cetacean retroviruses originated through two possible evolutionary scenarios, the land-to-water transition (LTW) scenario and the secondary host switching (SHS) scenario (Fig 1). In the LTW scenario (Fig 1A), a retrovirus infected the ancestor of cetaceans, integrated into its genome before or during (discussed below) the conquest of aquatic environment, and transited into water with their ancient cetacean hosts. Then, the ERV remnants (including solo-long terminal repeat [solo-LTR], if the ERV internal region was deleted due to Under the SHS scenario, the ERV should only be identified in a sub-lineage of cetaceans. The ERV is not expected to be closely related to ERVs from any certain vertebrate species. The phylogenetic relationships of cetaceans are based on TimeTree [64] and literatures [65,66]. Illustrations of mysticetes, odontocetes, and Pakicetus courtesy by Chris Huh, Chris Huh, and Conty, respectively. recombination between two LTRs) should be identified in the genomes of nearly all the modern cetaceans. Most of these cetacean ERVs are expected to be closely related to ERVs from Hippopotamus amphibious (Fig 1C). It should be noted that some ERVs might be lost during the evolutionary course of hippopotamuses or cetaceans, and some retroviruses might infect the ancestor of modern cetaceans during cetaceans invaded the aquatic environments but after cetaceans and hippopotamuses diverged. In the SHS scenario (Fig 1B), retroviruses infected cetaceans through cross-species transmission after the conquest of aquatic environments by cetaceans, and became integrated into the host genome. Then, the ERV can be only identified in a sub-lineage of cetaceans and might proliferate to a high copy number (Fig 1C). The ERV is not expected to be closely related to ERVs from a certain vertebrate species. Moreover, the species whose ERVs are identified to be closely related to the cetacean ERVs might not represent the "actual" source of the cetacean retroviruses.

Origins of retroviruses through land-to-water transition
Consistent with the LTW scenario, 298 (94.60%) out of 315 ERV lineages were found to be distributed in both mysticetes and odontocetes (Fig 2), implying that these ERV lineages were present in the last common ancestor of modern cetaceans. Class I ERVs account for a major proportion (276/298, 92.62%) of the LTW ERV lineages, among which lineages 1 to 117 and lineages 124 to 282 are closely related to gammaretroviruses and epsilonretroviruses, respectively (Fig 2). The remaining LTW ERV lineages (lineages 283 to 304) belong to Class III ERVs. The copy numbers of these LTW ERV lineages within a cetacean genome are generally very low (usually one copy in one genome) (Fig 2), suggesting that most of the LTW ERVs were not active after transiting to water along with their hosts. Some of these ERV lineages might be absent in certain species, due to internal region removal through recombination between the two LTRs of an ERV, degradation due to the absence of functional constraints, or occasionally sequencing error. To further elucidate the origin and evolutionary history of distinct ERV lineages in cetaceans, we performed phylogenetic analyses of cetacean ERVs and ERVs closely related within the vertebrate genomes for each ERV lineage. Interestingly, for 208 (69.80%) of these LTW ERV lineages, cetacean ERVs are closely related to ERVs from H. amphibius, indicating that these retrovirus endogenization events occurred before the last common ancestor of cetaceans and hippopotamuses. For the remaining 90 (30.20%) of these LTW ERV lineages, cetacean ERVs cluster with ERVs from diverse even-toed ungulates other than H. amphibious (Fig 3). This pattern can be explained by ERV removal in the lineage leading to H. amphibious, or cross-species transmission from even-toed ungulates other than H. amphibious to cetaceans after cetaceans and hippopotamuses diverged.
Moreover, we also performed synteny analyses for these LTW ERV lineages. For 28 LTW ERV lineages, we found orthologous ERV insertions between cetaceans and H. amphibious (Fig 4A and 4B and S6 Table). For 26 LTW lineages, we found orthologous ERV insertions between odontocetes and mysticetes ( Fig 4C and S7 Table). For the remaining 244 LTW ERV lineages, no complete ERV was identified, which makes it difficult to distinguish the host-ERV boundary to establish orthologous relationships. For these 244 LTW ERV lineages without full length ERVs, we used an event-based method to quantitatively compare phylogenetic congruence between ERVs and their cetacean hosts. For all the 244 lineages, we found ERV phylogenies are statistically congruent with the cetacean phylogeny (P < 0.01) (Fig 4D and S8 Table). These results further confirmed that these 298 ERV linages arose through retrovirus endogenization events that occurred before the last common ancestor of modern cetaceans (before or during the evolutionary transition from land to water of cetaceans), and these ERVs transitioned to aquatic environments within their host genomes.

Origins of retroviruses through secondary host switching
Consistent with the SHS scenario, we found 17 (5.40%) out of 315 cetacean ERV lineages are distributed in the genomes of species within a sub-lineage of cetaceans. Lineages 306, 309, 310 were only identified within the genomes of mysticetes, and lineages 118 to 123, 305, 307-308, 311 to 315 were only identified within the genomes of odontocetes (Fig 2). Lineages 118 to 123 belong to Class I ERVs and are closely related to gammaretroviruses, and lineages 305 to 315 belong to Class II ERVs (Figs 2 and S1) [22]. The copy numbers of these SHS ERV lineages are generally higher than those of the LTW ERV lineages; for example, lineage 121 ERVs reach 251 copies in the genome of Kogia breviceps. The SHS ERV lineages are closely related to ERVs from various mammals, including Chiroptera, Galeopterus variegatus, Manis javanica, Catagonus wagneri, Sigmodon hispidus, Scandentia, Monodelphis domestica, Procavia capensis, and Bovidae (Fig 3) [22]. However, five SHS ERV lineages are closely related to ERVs of mammals but their closest relatives could not be accurately identified. Once again, it should be noted that the species in which the ERVs closely related to a SHS ERV lineage were identified might

PLOS PATHOGENS
Evolution of cetacean retroviruses not represent the "actual" source. Moreover, the possibility that one SHS ERV lineage arose through multiple endogenization of closely related retroviruses cannot formally excluded. Nevertheless, our results indicate that these 17 ERV lineages may derive from the endogenization of retroviruses which infected cetaceans through cross-species transmission from noncetacean mammals after the land-to-water transition of cetaceans.

Temporal dynamics of cetacean ERV amplification
The long terminal repeats (LTRs) on both sides of a provirus are identical at the beginning of virus integration, followed by divergence due to neutral evolution in the host genome. Therefore, the timing of a single ERV integration event can be estimated by measuring the genetic distance between LTR sequences. The genetic distance between 5'-and 3'-LTRs of an ERV increases with its integration time [37,38]. To explore the temporal dynamics of the LTW ERV lineages, we retrieved all the complete LTW ERVs and calculated the genetic distance between their 5'-and 3'-LTRs (Fig 5A). The traditional estimation of ERV ages requires host neutral evolutionary rates to translate genetic distance into absolute time (in years). However, host neutral evolutionary rates based on known mammal rates might be not accurate for cetaceans. Instead, in this study, we directly compared the genetic distance of LTRs with that of cetacean neutrally evolving regions (introns used in this study) [39,40]. We first estimated the genetic distance of orthologous introns between Balaenoptera acutorostrata and H. amphibious (reflecting the divergence between cetaceans and hippopotamuses) as well as the genetic distance of orthologous introns between B. acutorostrata and Orcinus orca (reflecting the divergence between mysticetes and odontocetes). The peak of genetic distance between 5'-and 3'-LTRs of the LTW ERVs overlaps the mean genetic distance of introns between cetacean and hippopotamus and is much greater than the mean genetic distance of introns between mysticetes and odontocetes (Fig 5A). These analyses have two caveats: (I) Gene conversion might occur between 5'-and 3'-LTRs, which decreases their genetic distance [41,42]. Therefore, all the LTR sequences involving recombination or gene conversion were excluded in this study. (II) ERVs might not evolve at a similar rate as introns. Nevertheless, these results further support that most of the LTW ERVs invaded host genomes before the last common ancestor of the modern cetaceans.
We also investigated the temporal dynamics of the SHS ERV lineages by retrieving a total of 485 complete ERVs and calculating the genetic distance between 5'-and 3'-LTRs for each ERV (Fig 5A). Unlike the LTW ERVs, the genetic distance between 5'-and 3'-LTRs of these SHS ERVs peaks at 0, suggesting that a majority of the SHS ERVs proliferated in relative recent time. Then, we mapped 5'-and 3'-LTR distance distribution for 14 SHS ERV lineages (lineages 123, 305 and 315 were excluded due to their limited number of complete ERVs) (Fig 5B). We found that the genetic distance between 5'-and 3'-LTRs for all these 14 SHS ERV lineages peaked after the divergence of mysticetes and odontocetes, and seven SHS ERV lineages (lineages 118, 119, 121, 307 and 314) peak at 0. Ten SHS ERV lineages (lineages 118, 119, 121, 122, 306, 307, 310, 311, 313, and 314) contain ERVs with identical LTRs, suggesting that these ERV lineages might still actively proliferate. The genetic distances between 5'-and 3'-LTRs for all the SHS ERV lineages are less than the genetic distance of introns between cetaceans and hippopotamuses, and are less or around the genetic distance of introns between mysticetes and odontocetes, with lineage 122 as an exception (Fig 5B). For lineage 122, we found closely related ERV sequences in Elephantulus edwardii but not in the mysticetes (S2 Fig). Thus it is possible that the large 5'-and 3'-LTR distance might be due to local elevated evolutionary rates, but other possibilities cannot be formally excluded. Moreover, we identified orthologous ERV insertions in some but not all the cetaceans for several SHS ERV lineages, further

PLOS PATHOGENS
Evolution of cetacean retroviruses supporting these ERVs were still active after the divergence between mysticetes and odontocetes (see three examples in Fig 4A). Taken together, all these lines of evidence suggest that the SHS ERV lineages might originate independently from recent cross-species transmissions and have been actively transposing in cetaceans in relatively recent time after the divergence between mysticetes and odontocetes.

Modes of ERV proliferation in cetaceans
For the LTW ERV lineages, the ERV copy numbers within a single cetacean genome are generally low, further supporting that most of these lineages have not been active. However, for the SHS ERV lineages, the ERV copy numbers within a single genome are generally high, sometimes reaching hundreds of copies. ERVs have been thought to proliferate in the host genomes through either reinfection or retrotransposition. Under different proliferation modes, the three core genes (gag, pol, and env) are subject to different selection pressure. To explore the proliferation modes for the SHS ERV lineages, we performed selection pressure analyses of retroviral genes for eight lineages with greater than four ERVs in a certain species by estimating dN/dS ratios for internal branches [43]. For seven lineages (lineages 118, 119, 121, 122, 306, 310, and 314), we found that all the three retroviral genes are subject to purifying selection (dN/dS <1), indicating that these SHS ERVs might proliferate mainly through reinfection. Interestingly, the ERVs of lineage 307 lose env gene, and the gag and pol genes of lineage 307 ERVs underwent purifying selection (Fig 6). The proliferation of this lineage may be mainly through retrotransposition in cis or complementation by hitchhiking of the functional env gene of a co-infecting retrovirus [31,36].

Discussion
In this study, we investigated the evolutionary histories of ERVs within the cetaceans, a group of mammals that underwent a macroevolutionary transition from terrestrial to aquatic environments >50 million years ago. We identified a total of 315 distinct ERV lineages that belong to Class I, II, and III, suggesting that diverse retroviruses infected cetaceans and their ancestors. We found two major routes through which retroviruses evolved during the macroevolutionary transition from land to water by cetaceans, namely the land-to-water transition scenario and the secondary host switching scenario. A majority (about 95%) of ERV lineages as genomic loci (not exogenous retroviruses) appear to have undergone a shift from land to water with their cetacean hosts. The LTW scenario actually includes retroviruses that infected and became endogenized in the terrestrial ancestor of cetaceans (before the conquest of aquatic environment) and retroviruses that infected the semiaquatic or aquatic ancestor of cetaceans before the last common ancestor of mysticetes and odontocetes (during the conquest of aquatic environment), which cannot be clearly distinguished based on the current data. The ERV copy numbers for these ERV lineages are generally low within a single host genome, suggesting the activity of ERVs was not high upon transiting into water. These retrovirus lineages seem to await degradation after evolutionary journey to aquatic environment, namely evolutionary dead-ends.
Interestingly, we identified 17 ERVs lineages that are only present in a sub-lineage of cetaceans, either within mysticetes or within odontocetes. Synteny analysis and integration time analyses show that these ERVs derived from recent retroviral integrations. Phylogenetic analyses indicate these retroviral lineages might originate from cross-species transmissions after the colonization of aquatic environments of cetaceans. These ERVs might represent a proportion of retroviruses currently circulating in cetaceans, given not all the retroviruses in cetacean have been endogenized. Many of these cetacean ERVs are closely related to mammal species other than even-toed ungulates. Actually, cetaceans have more interaction with terrestrial and semi-aquatic mammals than intuitively thought; for example, killer wales have been seen feeding on terrestrial mammals and seals [44][45][46]. Indeed, our previous studies found retroviruses of aquatic and terrestrial origins are frequently interconnected with each other [27]. Therefore, the land-water interfaces might not present a strict barrier for retrovirus transmission [27]. However, no evidence that these ERVs originated from cross-species transmissions from fishes was found, although cetaceans have been feeding on fishes for long. The identical LTRs of many SHS ERVs suggest they integrated into the host genomes very recently, and the nonzero peaks of LTR distance distribution for some SHS ERV lineages indicate they might be still proliferating. The pathogenicity of these potentially active retroviruses remains to be explored. Understanding the diversity of retroviruses might have implications in the conservation biology of cetaceans.
The LTW ERV lineages belong to Class I and III ERVs, but the SHS ERV lineages belong to Class I and II ERVs, reflecting a changed retrovirus spectrum after diving into the aquatic environments. Given that the 17 SHS ERV lineages originated independently, our results suggest at least 17 cross-species transmission events from non-cetacean mammals to cetaceans occurred after cetaceans invaded aquatic environments but during the evolutionary course of the modern cetaceans. However, these ERVs only represent a proportion of retroviruses currently circulating in cetaceans, because not all the retroviruses in cetaceans have been endogenized. It is possible that host-switching from non-cetacean mammals to cetaceans might be more frequently than appreciated.
Dozens of viruses, both DNA viruses and RNA viruses, have been described in cetaceans. Due to the under-sampling of viruses in cetaceans and more generally wild mammals, it is difficult to explore how these DNA and RNA viruses originated in cetaceans, as well as how viruses evolved during the evolutionary transition from terrestrial to aquatic environments. In contrast, ERVs provide one of the best models to study these questions. Our study provides novel insights into the complex evolution of retroviruses, and possibly viruses in general, during the macroevolutionary transition of cetaceans.

ERV mining
All the cetacean genomes were retrieved from NCBI genome resources (https://www.ncbi. nlm.nih.gov/genome/), including 6 Mysticeti species and 19 Odontoceti species (S1 Table). We used a similarity search and phylogenetic analysis combined approach to identify ERVs in the cetacean genomes [27]. Briefly, we first used the tBLASTn algorithm to search the cetacean genomes with representative RT proteins as queries and an e cut-off value of 10 −5 and a length cut-off value of 150 amino acids. Due to homology shared between retroviruses and retrotransposons, we performed phylogenetic analysis of the significant hits and RT proteins of representative retroviruses and retrotransposons [27]. Sequences forming a monophyletic group with representative retroviruses are authentic ERVs. We also used the forementioned method to mine ERVs in representative vertebrates (S2 Table). We performed large-scale phylogenetic analyses of RT proteins from cetacean ERVs, representative vertebrate ERVs, and representative exogenous retroviruses (S4 Table). Protein sequences were aligned using MAFFT 7.450 [47]. Initial large-scale phylogenetic analyses were performed using an approximate maximum likelihood method implemented in FastTree 2.1.10 [48]. A monophyletic group of cetacean ERVs was treated as a distinct lineage for the subsequent analyses.

Classification of cetacean ERVs
To explore the relationship between cetacean ERVs and retroviruses, one representative sequence was selected for each cetacean ERV lineage. The RT sequences of cetacean ERVs and representative endogenous and exogenous retroviruses were aligned using MAFFT 7.450 [47] (S3 and S4 Tables). Phylogenetic analysis was performed using a maximum likelihood method implemented in IQ-tree 2 [49] (S1 Fig). ModelFinder implemented in IQ-tree 2 was used to determine the best-fitting model [50]. The node supports were evaluated using the ultrafast bootstrap method with 1,000 replicates [51,52].

Identifying putative sources of cetacean ERVs
To confirm the distribution of each cetacean ERV lineage in cetaceans and to identify the putative source of each cetacean ERV lineage, we used the BLASTn algorithm to search against the cetacean genomes and all the currently available vertebrate genomes with representative cetacean ERV RT sequences and an e cut-off value of 10 −5 . All the sequences were aligned using the L-INS-i strategy implemented in MAFFT 7.450 [47] and then manually refined. Phylogenetic analyses for each cetacean ERV lineage were performed using a maximum likelihood method implemented in IQ-tree 2 [49] (S2 Fig).

Dating the invasion time of cetacean ERVs
To identify complete ERVs, we bidirectionally extended all the cetacean ERV RT sequences and used LTRfinder [53] and LTRharvest [54] to identify the LTRs. The complete ERVs were annotated using Conserved Domain search [55] and the BLASTp algorithm [56]. For each cetacean ERV lineage, the 5'-and 3'-LTRs was aligned using MUSCLE [57], and phylogenetic analyses were performed using IQ-tree 2 [49]. The ERVs whose 5'-and 3'-LTRs cluster together were retrieved for further analyses [58]. The 3SEQ algorithm implemented in RDP4 [59,60] was used to detect recombination among LTR sequences. GeneConv [61] was used to detect gene conversion occurring in LTR sequences [58]. LTR sequences with signals of recombination or gene conversion were excluded from the dating analyses. The genetic distance between 5'-LTR and 3'-LTR was estimated with the Kimura two-parameter substitution model [62]. To get a cetacean evolutionary time scale, 116 random orthologous introns of B. acutorostrata and H. amphibian and 100 random orthologous introns of Balaenoptera acutorostrata and Orcinus orca were retrieved and aligned using MAFFT 7.450 [47].

Identification of orthologous ERV insertions
To identify the orthologous insertions of a complete ERV between cetaceans and H. amphibious or between mysticetes and odontocetes, we bidirectionally extended 500-1,000 bp sequences flanking the ERV (S5-S7 Tables). We used the BLASTn algorithm to search against the genomes of cetaceans and H. amphibious with the flanking sequences and the ERV as the queries. If the two flanking sequences are connected to each other, there is no ERV insertion.

Test of the congruence between ERV and cetacean phylogenies
To assess whether ERV phylogenies are congruent with the cetacean phylogeny, we used an event-based approach implemented in Jane 4 [63]. The cost scheme of cospeciation-duplication-duplication with host switching-loss-failure to diverge was set as 0-1-2-1-1 [27]. Sample size for random parasitic tree analysis was set to 50 (S8 Table). The cetacean phylogeny used in this study was based on TimeTree [64] and literatures [65,66].

Selection pressure analyses of cetacean ERV genes
Within a single cetacean species, all the complete ERV sequences for each SHS ERV lineage were retrieved and aligned with the L-INS-i strategy using MAFFT 7.450 [47]. Datasets with less than four sequences or with sequences with a common disruptive mutation were excluded. The ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/) was used to predict ORFs. The Conserved Domains (CD) Search [55] and the BLASTp algorithm [56] were used to determine the retroviral genes, namely gag, pol, and env. Premature stop codons were removed. The dN/ dS ratio was estimated using the CodeML program in PAML 4.9 [67]. The "one-ratio" model is used to calculate the overall dN/dS ratio, and the "two-ratio" model is used to estimate the dN/ dS ratios for internal and terminal branches. The likelihood ratio test was used to evaluate the significance of the difference between the "one-ratio" model and the "two-ratio" model.  Table. The information of the representative retroviruses is available in S4 Table. (PDF)