Skip to main content
Advertisement
  • Loading metrics

Endogenous viral elements trace the ancient origins and early evolution of the Caulimoviridae

Abstract

Endogenous viral elements (EVEs) are viral sequences integrated into host genomes, functioning as molecular fossils of past infections. Most characterised EVEs in plants are derived from the Caulimoviridae, the only family of dsDNA viruses infecting this kingdom. Endogenous caulimovirids (ECVs) occur across taxonomically diverse vascular plant species and represent a significant resource for studying host-virus coevolution, host range dynamics, and the evolution of viral genomes over deep timescales. Previous evolutionary studies utilising ECVs have proposed cospeciation or host switching as the primary drivers of Caulimoviridae diversification; however, these studies were limited by poor representation of genomic data from basal plant lineages. Here, we analysed 93 plant genomes spanning all major embryophyte groups, including ferns and lycophytes, and identified 47,135 ECVs across 75 genomes. These sequences were classified into 71 operational taxonomic units (OTUs), including 35 previously undescribed groups, revealing substantial and previously unrecognised diversity within the Caulimoviridae. Notably, we identified a basal clade restricted to the Araucariaceae, an ancient lineage of Gondwanan conifers. Phylogenetic comparisons between ECVs and host plant lineages support a macroevolutionary model in which cospeciation with tracheophytes played a dominant role in shaping Caulimoviridae diversification. Together, these findings establish Caulimoviridae and their endogenous counterparts as a powerful model system for paleovirology, offering unprecedented insights into the coevolution, diversification, and extinction of plant viruses over deep evolutionary timescales.

Author summary

Viruses have infected plants for hundreds of millions of years, but tracing their long-term evolution is difficult because viruses never leave direct fossil evidence. However, fragments of viral DNA sometimes become permanently embedded in plant genomes. These sequences, called endogenous viral elements (EVEs), act as molecular “fossils” that preserve evidence of ancient infections. In this study, we explored EVEs related to the plant virus family Caulimoviridae, the only known reverse transcribing viruses that infect plants. By analyzing 93 plant genomes representing all major groups of land plants, including ferns, lycophytes, gymnosperms, and angiosperms, we identified more than 47,000 viral sequences integrated into plant DNA. These sequences revealed an unexpectedly high diversity of Caulimoviridae, including 35 previously unknown evolutionary lineages and a newly recognized viral clade restricted to certain conifers. By comparing viral and plant evolutionary histories, we found evidence that many of these viruses diversified alongside their hosts over hundreds of millions of years, although host switches and extinctions also occurred. Together, our results show that endogenous viral sequences provide a powerful window into the deep evolutionary history of plant viruses and their long-term interactions with plants.

Introduction

The evolution of land plants has been significantly shaped by selective pressures imposed by pathogens, including viruses. Although numerous plant virus families have been identified [1,2], their origins and patterns of macroevolution remain poorly understood. Metagenomic surveys have revealed a wealth of previously undetected viruses and firmly established viruses as integral components of ecosystems, emphasizing the vast yet largely undocumented plant virosphere [3]. However, viral metagenomics generally provides only a temporal snapshot of viral communities. In contrast, viral sequences integrated into host genomes, known as endogenous viral elements (EVEs), provide a unique, untargeted record of historical viral infections. Mining host genomes for EVEs has proven highly informative, as these molecular fossils allow the reconstruction of past viral diversity and host-virus interactions across evolutionary timescales, providing invaluable insights into the macroevolution of viruses [4].

In plants, most characterized EVEs are derived from the Caulimoviridae, the only family of dsDNA viruses in this kingdom [5]. Caulimovirids possess non-covalently closed, circular, double-stranded DNA genomes of 7.1–9.8 kbp, which encode a core suite of proteins including a 30K movement protein (MP), a capsid protein (CP), a minor virion-associated protein (VAP) and a polymerase polyprotein (Pol) containing aspartic protease (AP), reverse transcriptase (RT), and ribonuclease H1 (RH1) domains [6]. They are currently classified into eleven genera based on virion morphology, genome organization, and transmission mode. Species are distinguished by RT-RH1 sequence divergence, with an 80% nucleotide identity threshold. Viruses in some genera encode auxiliary proteins, such as the aphid transmission factor in the genera Caulimovirus and Soymovirus, which allow these viruses to occupy specific ecological niches [6].

The Caulimoviridae belongs to the order Ortervirales, along with Retroviridae, Belpaoviridae (Bel/Pao elements), Metaviridae (Ty3/Gypsy elements), and Pseudoviridae (Ty1/Copia elements) [7]. All share a replication cycle alternating between dsDNA and ssRNA via reverse transcription. Members of the Caulimoviridae differ from other Ortervirales by lacking an obligatory proviral stage in the replication cycle, whereby the viral genome integrates in the host genome through the action of an integrase enzyme. Nevertheless, their sequences are widely integrated into plant genomes, likely due to incorporation as filler DNA during the repair of double-strand DNA breaks in host chromosomes [5].

Since the first endogenous caulimovirids (ECVs) were reported in the late 1990s [810], advances in plant genome sequencing have enabled their widespread identification through bioinformatic approaches. While only a few ECVs are replication-competent and capable of reactivation [1113], dozens of complete or partial viral genomes have been assembled from ECVs by in silico assembly of overlapping DNA fragments. These genomes show all the hallmarks of extant caulimovirids, including the presence of genes encoding the core suite of movement, structural, and replication associated proteins, and conserved transcriptional and translational regulatory elements. Although phylogenetic analyses place them within the family, many stand outside the 108 species and eleven genera currently recognized by the International Committee for Taxonomy of Viruses (ICTV) [6,1419]. Most such viral genomes were assembled from ECVs by aligning multiple, highly similar copies that share greater than 90% nucleotide identity in the region encoding the RT-RH1 domain.

Previous studies have shown that ECVs extend across all major euphyllophyte lineages (ferns, gymnosperms, and angiosperms), far exceeding the host range of known episomal caulimovirids [5]. Mushegian and Elena (2015) [15] were the first to identify caulimovirid MP homologs in the genomes of gymnosperms and ferns. This host distribution was later confirmed and extended by Gong & Han (2018) [17] and Diop et al. (2018) [16]; the latter study also reported the presence of a caulimovirid RT transcript in the lycophyte Lycopodium annotinum. However, ECV studies were constrained by a poor representation of genomic and transcriptomic data from basal plants. The surge in high-throughput sequencing, particularly of genomes from previously underrepresented taxa such as lycophytes, now enables a more comprehensive exploration of ECVs across tracheophytes. Automated bioinformatic pipelines for ECV detection and annotation [20] further facilitate host-range assessment and refinement of evolutionary hypotheses.

Here, we investigated ECV diversity in 93 plant genome assemblies and transcriptomes spanning basal and representative tracheophytes. Thousands of novel ECVs were identified, extending the Caulimoviridae host range to all tracheophyte divisions. By integrating these with prior studies [14,1619,21,22], we propose a macroevolutionary model in which long-term host–virus cospeciation was punctuated by host and virus extinction events. Our findings position Caulimoviridae and their endogenous counterparts as a central model for investigating the evolution of plant viruses.

Results

Extended diversity and host range of the Caulimoviridae

We screened 73 plant genomes (Fig 1 and S1 Table) for the presence of endogenous caulimovirid RT protein coding sequences (ECRTs) using the Caulifinder pipeline branch B described by Vassilieff et al. (2022) [20]. The dataset included seven bryophytes and 66 tracheophytes comprising nine lycophytes, 17 ferns, 27 gymnosperms, three basal angiosperms (ANA-grade), and ten mesangiosperms (one species of Chloranthaceae and nine Magnoliids). This screening yielded 20,267 amino acid ECRT sequences (aa-ECRTs) across 55 plant species (Fig 1). No aa-ECRTs were detected in any of the seven bryophyte genomes analysed, nor in six lycophytes and five ferns (Fig 1). In contrast, aa-ECRTs were detected in the genomes of three lycophytes, twelve ferns, and all gymnosperm, basal angiosperm, Chloranthaceae, and Magnoliid species examined (Fig 1). The number of aa-ECRTs per genome varied greatly, even within the same plant division, particularly in gymnosperms, where the number of aa-ECRTs ranged from three in Metasequoia glyptostroboides to 3,352 in Pinus tabuliformis. A positive correlation was observed between the number of aa-ECRTs and host genome size (p-value = 9e-12). However, genome size alone accounted for only half of the variation (R² = 0.5; S1 Fig).

thumbnail
Fig 1. Distribution of aa-ECRTs, aa-ECRTs rep, and OTUs across plant genomes.

The cladogram on the left illustrates the phylogenetic relationships among the 73 plant species analyzed. Species previously examined for ECVs detection are highlighted in red [1417,19,21], while species investigated for the first time are highlighted in blue. Major plant lineages are annotated, including bryophytes, lycophytes, ferns, gymnosperms, and angiosperms, with the latter subdivided into ANA grade, Chloranthales, and Magnoliids. Taxonomic groups are labeled at the family level (lycophytes, gymnosperms and angiosperms) or order level (ferns). For each plant species, the following metrics are presented: species label, total number of detected aa-ECRTs, number of representative aa-ECRTs (aa-ECRTs rep), number of reference OTUs detected (OTU ref), number of newly identified OTUs (New OTU), genome size (in Mbp), and N50 contig value (in kbp), as computed by QUAST [23]. One aa-ECRT, which was detected in the genome of Anthoceros agrestis, was confirmed as a transposable element in a subsequent analysis and is therefore indicated with an asterisk (*). Silhouettes are sourced from the PhyloPic image repository (https://www.phylopic.org/).

https://doi.org/10.1371/journal.ppat.1014340.g001

Based on the 80% minimum amino acid (aa) identity threshold used by Caulifinder branch B [20], these sequences were grouped by host species into 1,684 clusters. The longest sequence from each cluster was selected as the representative aa-ECRT (aa-ECRT rep, Fig 1). The resulting set of 1,684 aa-ECRT rep was subsequently filtered to select the most divergent sequences. For this, we performed iterations of sequence clustering, alignment with 98 aa-RT reference caulimovirid sequences, and manual curation, resulting in the selection of a subset of 261 aa-ECRT rep. We also searched publicly available transcriptomic datasets (from NCBI GenBank [24], the 1KP project [25], and datasets from two transcriptomic studies dedicated to ferns and lycophytes, respectively [26,27]) for aa-ECRTs. Using the filtering process described above yielded two additional aa-ECRTs, from the transcriptomes of the ferns Brainea insignis (GFUE01038349.1) and Pteris vittata (GGXV01034871.1), bringing the total number of aa-ECRT rep in the subset to 263.

Using an amino acid similarity threshold of 62%, allowing for clear separation of viral genera recognized by the ICTV into distinct operational taxonomic units (OTUs), the 263 aa-ECRT rep sequences combined with the 98 caulimovirid aa-RT reference sequences and eight outgroup sequences (from Metaviridae and Retroviridae) were grouped into 71 genus-level OTUs (S2 Table). Thirty-six OTUs included at least one aa-RT reference caulimovirid sequence. They were named either after their corresponding ICTV-recognized genus [6] (e.g., OTU Soymovirus) or after previously defined OTUs [14,1619,28] (e.g., OTU Florendovirus), and are referred to as “reference OTUs”.

As expected, some of the 36 reference OTUs corresponded to subdivisions of OTUs defined in previous works that used lower sequence similarity thresholds [16,19] (S2 Table). For instance, OTU Gymnendovirus 2 [16] was split into three distinct OTUs, two of which included β-type gymnosperm endogenous caulimovirus-like viruses (β-GECV) [17]. The remaining 35 OTUs lacked reference sequences and were considered novel. These mainly consisted of sequences retrieved from plant genomes or transcriptomes that had not previously been screened for aa-ECRTs. Gymnosperms proved a particularly rich source of novel OTUs, especially Wollemia nobilis (Araucariaceae), which harbored five (Fig 1).

Phylogenetic analyses

To determine the phylogenetic placement of the 35 novel OTUs, we performed a phylogenetic analysis using nucleotide sequences encoding the RT-RH1 domain (ntRT-RH1), the most conserved region within caulimovirid genomes [6]. Such sequences were extracted from the 55 tracheophyte genomes in the dataset that contain ECVs (S3 Table) using the Caulifinder pipeline branch A [20], which reconstructs caulimovirid consensus sequences by aligning at least five ECV copies from the same host that share at least 85% nt identity. After filtering out LTR retrotransposons, 9,031 ECV consensus sequences were recovered from 49 plant species (three lycophytes, six ferns, and all gymnosperms (27), basal angiosperms (3), Chloranthaceae (1), and Magnoliids (9)) (S3 Table).

Each ECV consensus sequence containing an ntRT-RH1 sequence (n = 6,503) was assigned to one of the 71 previously defined OTUs, and 47 ECV consensus sequences representing 22 novel OTUs were selected. Among these, 17 OTUs were represented by two to four consensus sequences (S2 Fig). For 17 of the 18 remaining novel OTUs, where either none or a single consensus sequence was available, ntRT-RH1 sequences were selected from 21 genomic ECV copies, and from one transcript from Brainea insignis (GFUE01038349.1) (S2 Fig). In total, 69 ntRT-RH1 sequences representing 34 of the 35 novel OTUs were obtained. The only exception was OTU 34, for which only highly mutated ntRT-RH1 sequences were recovered, precluding the selection of a representative sequence (S4 Table). The dataset was incremented with 73 ntRT-RH1 sequences from reference OTUs, including 40 ntRT-RH1 sequences from representative members, and 33 ntRT-RH1 sequences identified in this study from 8 angiosperm genomes (S4 and S5 Tables). For three of the 36 reference OTUs, no representative ntRT-RH1 sequences could be recovered. Altogether, 142 ntRT-RH1 sequences representing 67 OTUs were compiled for phylogenetic analysis (S4 Table).

Two phylogenetic trees were constructed using Maximum Likelihood (ML) and Bayesian methods (S3 Fig), both yielding highly similar topologies. We selected the Bayesian tree for further analysis because it provided stronger support at basal nodes (posterior probability ≥ 0.95). Rooted with Saccharomyces cerevisiae Ty3 virus (Metaviridae exemplar virus), the tree was simplified by collapsing branches containing a single OTU (Fig 2). The 142 caulimovirid ntRT-RH1 sequences grouped according to their respective OTUs, except for the OTU ‘Wendovirus + Xendovirus gp S. tuberosum’ which appeared polyphyletic and the single sequence from OTU 30, which was nested within OTU 1 (Figs 2 and S3).

thumbnail
Fig 2. Bayesian phylogeny of Caulimoviridae, with collapsed OTU branches.

The phylogenetic tree comprises 142 nucleotide sequences corresponding to the RT-RH1 domain of Caulimoviridae. Bayesian inference was conducted using MrBayes [2931], and the tree is rooted with a Metaviridae Ty3 sequence (highlighted in red). Red circles indicate nodes with posterior probability supports ≥ 0.95. Triangles represent collapsed monophyletic groups corresponding to individual OTUs. Polyphyletic OTUs are shown in multiple locations; “Wendovirus+Xendovirus gp S. tuberosum” appears twice. OTUs newly identified in this study are highlighted in blue, while previously defined (reference) OTUs are shown in black. The tree is divided into three major clades, color-coded as follows: Clade A (orange), Clade B (green), and Clade C (blue).

https://doi.org/10.1371/journal.ppat.1014340.g002

The tree is divided into three main clades, A, B, and C, with Clades A and B previously defined by Diop et al. (2018) [16] and Clade B subdivided into two subclades: B.1 and B.2. Seven of the 34 new OTUs (Table 1) grouped within Clade A. Among these, OTU 21 was sister to all Clade A members, while the other six were closer to either previously described ECVs or ICTV-recognized genera. In Clade B, 26 novel OTUs (Table 1) were distributed across the topology; half of them were closely related to known ECVs while twelve formed two new groups, B.2.1 and B.2.2. In addition, OTU 23, was grouped with Petuvirus, the only ICTV-recognized genus in Clade B. Subclade B.2 was further divided into groups of closely related OTUs to better describe its extended diversity (Fig 2). Remarkably, our phylogenetic analysis placed OTU 19 as a sister to Clades A and B, leading to its provisional assignment to a new Clade C.

thumbnail
Table 1. Distribution of new and reference OTUs across Caulimoviridae clades. The table presents the distribution of newly identified and previously defined reference OTUs across the three major clades of the Caulimoviridae family: Clade A, Clade B, and Clade C. The classification is based on the the composition of the caulimovirid OTUs obtained using a 62% identity threshold for RT clustering (S2 Table).

https://doi.org/10.1371/journal.ppat.1014340.t001

Characterization of caulimovirid Clade C

In both unrooted ML and Bayesian trees (S3 Fig), OTU 19 formed a polytomy with the outgroup (Ty3) and Caulimoviridae Clades A and B, leaving its placement unresolved. To clarify its phylogenetic position within Ortervirales, we performed an ML phylogenetic analysis using RT domain amino acid sequences (aa-RT) from OTU 19, alongside representatives of Caulimoviridae Clades A and B and of the four other Ortervirales families (S6 Table), following the RT-based framework of Krupovic et al. (2018) [7].

The resulting tree resolved the polytomy (Fig 3a), placing OTU 19 as a sister clade to Clades A and B. OTU 19 includes sequences from Wollemia nobilis and Araucaria angustifolia, two critically endangered Araucariaceae species [34,35]. The 30K MP domain is shared among many viruses, prompting us to analyze the phylogenetic relationships of the MP domains from OTU 19 sequences alongside those from other caulimovirid clades and 14 plant virus families [32]. While MPs from many families were polyphyletic, Caulimoviridae MPs formed a monophyletic group with OTU 19 nested within (Fig 3b).

thumbnail
Fig 3. Taxonomic placement of Clade C and genome structure of Wollendoviruses.

(a) ML phylogeny of reverse transcriptase domains from 52 Ortervirales sequences. Red dots mark branches with bootstrap support ≥ 50%. Clades are collapsed for Caulimoviridae and by family for other Ortervirales members. (b) ML phylogeny of the 30K movement protein (MP) family, including 46 Caulimoviridae sequences identified using Caulifinder and 286 sequences from Butkovic et al. (2024) [32], representing the following plant viral families: Alphaflexiviridae, Aspiviridae, Betaflexiviridae, Bromoviridae, Botourmiaviridae, Caulimoviridae, Fimoviridae, Geminiviridae, Kitaviridae, Mayoviridae, Phenuiviridae, Rhabdoviridae, Secoviridae, Tospoviridae, and Virgaviridae. The tree is midpoint-rooted. For Caulimoviridae sequences, OTU IDs are listed, followed by sequence names; NCBI accession numbers are provided for reference sequences. OTU 19 is highlighted in green. Red dots indicate branches supported by bootstrap values >= 80%. For non-caulimovirid families, branches are collapsed by family and represented as triangles. (c) Genome organisation of the proposed Wollendovirus type member, WolV1, reconstructed from sequence assembly. Open reading frames (ORFs), predicted using ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/), are shown as light grey rectangles. ORF of unknown function is shaded. The black diamond indicates the tRNAMet primer binding site, corresponding to genome position +1. Conserved domains identified via CD-search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi) against the CDD database [33] are color-coded: viral movement protein (MP; PF01107) in blue; retropepsin (AP; CD00303) in red; reverse transcriptase (RT; CD01647) in orange; RNase H1 (RH1; CD06222) in yellow; zinc finger in green.

https://doi.org/10.1371/journal.ppat.1014340.g003

We assembled a representative genome from W. nobilis ECVs, and the corresponding virus was tentatively named Wollendovirus 1 (WolV1; Fig 3c). The reconstructed WolV1 genome is 9,631 bp long and contains a 16 bp tRNAMet (TGGTATCAGAGCCAGG) primer binding site, unique in its genomic position but similar in sequence to other caulimovirids [6]. WolV1 contains two putative ORFs encoding, in order, a large polyprotein (2,107 aa, 243 kDa) featuring canonical caulimovirid domains (MP, CP, AP, RT and RH1), and a 568 aa (67 kDa) protein of unknown function with no homolog in GenBank.

The substantial structural and organisational similarity between the WolV1 genome and those of known caulimovirids suggests a synapomorphic relationship. Combined with the close phylogenetic relationship between OTU 19 MP and RT domains with those of other caulimovirids, this evidence supports the classification of Clade C as a distinct new lineage within the Caulimoviridae family.

Distribution of ECVs across plant genome assemblies

To assess the distribution of ECVs across plant taxa, we screened 93 land plant genome assemblies (S5 and S7 Tables) for endogenous caulimovirid RT nucleotide sequences (nt-ECRTs). Utilisation of this approach enabled the inclusion of sequences with mutations that disrupt open reading frames. From this analysis, we identified 47,135 nt-ECRTs across 75 tracheophyte genomes (S8 Table). Copy numbers consistently exceeded those of aa-ECRTs, averaging approximately threefold and peaking at 18.7-fold in Metasequoia glyptostroboides. nt-ECRTs were not observed in any genomes that lacked aa-ECRTs. The abundance of nt-ECRTs was positively correlated with host genome size (r2 = 0.45, p = 9e-13). This trend held within lycophytes (r2 = 0.85, p = 4e-04), gymnosperms (r2 = 0.46, p = 2.8e-04), and angiosperms (r2 = 0.43, p = 0.008) after multiple testing correction [36], but was weaker in ferns (r2 = 0.24, p = 0.03) (S4 Fig).

All nt-ECRTs were assigned to OTUs using BLASTx against representative aa-RTs with a stringent 65% amino acid identity threshold, which we chose slightly above the OTU demarcation threshold (62%) to ensure unambiguous assignment (Fig 4 and S9 Table). All 47,135 nt-ECRTs were assigned to an OTU: 32,303 were assigned to 33 reference OTUs and 14,832 to 34 novel OTUs. Distribution varied widely between OTUs in both sequence abundance and host plant diversity. For example, Capsicum annuum contained the greatest number of nt-ECRTs linked to a single OTU, with 1,945 nt-ECRTs assigned to Solendovirus (Fig 4 and S9 Table). Florendoviruses exhibited the broadest host range, with 6,982 nt-ECRTs identified across 28 angiosperm species. In contrast, OTU 21 was represented by only two nt-ECRTs, in ANA-grade species Nymphaea colorata and Nymphaea thermarum.

thumbnail
Fig 4. Heatmap of nt-ECRT abundance by OTU across plant hosts.

The heatmap displays the distribution of nt-ECRTs across plant genomes, grouped by host species (left) and viral OTUs (top right). Left: the host plant phylogeny is adapted from the TimeTree [37] database (accessed on 20/09/24). Geological periods are shown along the top, with red bars indicating the standard deviation of estimated emergence times for major plant lineages, based on the following sources: Lycophytes (432.5–392.8 MYA; Morris et al., 2018 [38]), Ferns (411.5- 384 MYA Morris et al., 2018 [38]; Lehtonen et al., 2017 [39]), Gymnosperms (380–360 MYA Rothwell et al., 1989 [40]; Stull et al., 2021 [41]), and Angiosperms (246.5–130 MYA; Herendeen et al., 2017 [42]). Top: the Caulimoviridae cladogram is derived from the Bayesian tree of 142 RT-RH1 nucleotide sequences (see Fig 4), with major clades color-coded as follows: Clade A (orange), Clade B (green), and Clade C (blue). For visualization purposes, the polyphyletic OTU ‘Wendovirus+Xendovirus gp S. tuberosum’ (Fig 4) is represented here as a single, monophyletic clade. For clarity some OTU names have been shortened: Xendovirus gp G. raimondii to Xendovirus Grai; Unclassified gp V. vinifera+Yendovirus+Orendovirus to Unclassified Vvin; Dioscovirus+Badnavirus D. alata to Dioscovirus; Wendovirus+Xendovirus gp S. tuberosum to Wendo-Xendovirus stub; Gymnendovirus 2 gp P. glauca+PtaeV 2 + Pinus nigra virus to Gymnendovirus 2 gp P. glauca; Gymnendovirus 2 gp G. biloba 2 + GbilV to Gymnendovirus 2 gp G. biloba 2; Gymnendovirus 1 + PglaV2 to Gymnendovirus1. Color intensity in the heatmap corresponds to the number of nt-ECRT copies assigned to each OTU in a given host genome.

https://doi.org/10.1371/journal.ppat.1014340.g004

Strong lineage-specific patterns were observed across the three caulimovirid clades. In Clade A, nt-ECRTs were predominantly detected in angiosperm genomes (Fig 4) but with four notable exceptions: (i) OTU 5 was restricted to Gnetaceae (Gnetum gnemon and Gnetum montanum); (ii) OTU 15 was detected solely in five of seven Cupressaceae genomes (Sequoia sempervirens, Sequoiadendron giganteum, Cryptomeria japonica, Juniperus communis and Cupressus sempervirens); (iii) Wendovirus 4-related nt-ECRTs were found in both angiosperms (Lindenbergia philippensis) and ferns (Azolla pinnata and Rumohra adiantiformis); and (iv) OTU 10 was restricted to the fern Rumohra adiantiformis, with transcriptomic evidence suggesting its presence in the ferns Brainea insignis and Pteris vittata. Furthermore, OTU 21, which occupied a basal position within Clade A, was limited to the basal angiosperm family Nymphaeaceae. In Clade B, nt-ECRTs were primarily restricted to lycophyte, fern, and gymnosperm genomes, with two notable exceptions: florendoviruses and petuviruses were restricted to angiosperms and frequently had high copy numbers. Additionally, OTU 1 (subclade B.1) exhibited a broad host range, encompassing gymnosperms and basal angiosperms (Nymphaeaceae, Amborellaceae). In contrast, its sister lineage, OTU 30, was restricted to magnoliids. Finally, Clade C (OTU 19) nt-ECRTs were restricted in host range, with detections limited to Wollemia nobilis, Araucaria angustifolia, and Agathis dammara, all from the Araucariaceae.

Identification of caulimovirid sequences in plant transcriptomes

To further investigate caulimovirid-host interactions in basal plant lineages, we used Caulifinder branch B [20] on the same transcriptomic dataset as above [2427]. This approach is more inclusive as it does not discriminate between transcripts from endogenous and episomal viral DNA.

We identified 529 transcripts encoding caulimovirid RT domains, 491 of which originated from angiosperms (S10 Table). For taxonomic assignment, we selected 194 transcripts covering ≥ 80% of the RT domain. Notably, 13 transcripts were from lineages absent in genome datasets: eight from ferns, two from gymnosperms, and three from ANA-grade angiosperms (Table 2 and S5 Fig). Fern sequences originated from the orders Cyatheales, Polypodiales, Marattiales, and Ophioglossales, the latter two representing basal lineages [43]. One of these sequences was assigned to the OTU Fernendovirus 1 P. formosana and two to the OTU Fernendovirus 1 C. protusa, within the ‘B.2.Fernendovirus 1’ group (Fig 2 and Table 2). The remaining five fern sequences lacked sufficient identity (below 62%) for OTU assignment using tBLASTn, but two, from Botrypus virginianus (Ophioglossales) and Lindsaea linearis (Polypodiales), placed within the ‘B.2.Fernendovirus 1’ group (S6 Fig). The two gymnosperm transcripts, from families Zamiaceae (Cycadales) and Podocarpaceae (Araucariales), were assigned to OTUs 1 and 3, respectively. For ANA-grade angiosperms, three transcripts from Austrobaileya scandens (Austrobaileyales) were classified as Florendovirus, two clustering near OsatBV in the ‘B.2 Florendovirus’ group (S7 Fig) [14]. Additionally, five transcripts from Lycopodiaceae, an underrepresented family in genome datasets, were analyzed. Two were assigned to OTU Fernendovirus 1 P. formosana and one to OTU 35 within the ‘B.2.Fernendovirus 1’ group, while two clustered with Badnavirus sequences from angiosperms.

thumbnail
Table 2. nt-RTs detected in the three transcriptome sources. This table presents details of the 18 transcripts retained from the search for caulimovirid RTs in plant transcriptomes. Each entry includes the following information: (i) Taxonomic information of the plant: division (column 1), species (column 2), genus (column 3), family (column 4); (ii) BLAST analysis results against the 361 aa-RT sequences used to define OTUs: OTU with the best hit (column 5), identity percentage of the best hit (column 6), transcript length in bp (column 7); (iii) Reference information: source publication (column 8), transcript accession number (column 9).

https://doi.org/10.1371/journal.ppat.1014340.t002

Evidence for patterns of cospeciation

To investigate the macroevolution of Caulimoviridae, we integrated host-virus associations from genomic and transcriptomic data, as well as from episomal caulimovirids. Fig 5 illustrates a condensed interactome linking viral clades or subclades to plant families or orders. In combination with the distribution data (Fig 4), this synthesis enabled us to investigate the patterns of caulimovirid-host coevolution at a macroscopic scale. We found that caulimovirids and their hosts do not exhibit overall parallel phylogenies, some plant divisions harboring distinct caulimovirid lineages, such as Clade A, B and C being associated with gymnosperm hosts (Figs 4 and 5). However, we observed several host-caulimovirid distribution subpatterns that align with the coexistence of multiple deeply rooted caulimovirid lineages (Fig 6), outlined in the following points i) through v).

thumbnail
Fig 5. Distribution of Caulimoviridae clades and subclades across plant taxa.

Left: a plant cladogram depicts the major plant divisions of embryophytes for which either genomic or transcriptomic data were analyzed. Lycophytes and gymnosperms are resolved to the family level due to broader data availability. Ferns, with more limited genomic representation, are shown at the order level. Angiosperms are split into basal ANA grade and Mesangiospermae, which includes Magnoliids, Chloranthales, monocots, and dicots. Top: A viral cladogram summarizes the major Caulimoviridae clades and subclades as defined in Fig 3. Intersection circles indicate the detection of Caulimoviridae sequences within each host taxon, with the following color code. Blue: detection of episomal viral forms, including Petunia vein clearing viruses (B.2 Petuvirus [6]), Welwitschia mirabilis viruses 1 and 2 [18], cycad leaf necrosis virus (Clade A, genus Badnavirus [44]), and additional ICTV-recognized Caulimoviridae [6]. Red: detection of ECVs in host genomes. Green: detection of caulimovirid transcripts in transcriptomic datasets. Circles combining multiple colors indicate multiple forms of evidence (E.g., genomic + transcriptomic). Silhouettes are sourced from the PhyloPic image repository (https://www.phylopic.org/).

https://doi.org/10.1371/journal.ppat.1014340.g005

thumbnail
Fig 6. Evolution of Caulimoviridae diversity over time in relation to host range.

The figure integrates viral diversity with plant evolutionary history along a geological timescale adapted from McLoughlin, (2021) [45]. Timeline: key evolutionary milestones are indicated, including (i) the emergence of tracheophytes and euphyllophytes [38]; (ii) the divergence of major vascular plant lineages: lycophytes, ferns, gymnosperms and angiosperms [38,39,41,42,46]; (iii): the last common ancestor (LCA) of Gleicheniales-Cyatheales and Polypodiales-Salviniales [39]; and (iv) the LCA Cupressaceae and Araucariaceae [41]. The known host range of each clade is summarized and matched to key plant divergence points on the geological timeline, illustrating the potential cospeciation and diversification of Caulimoviridae alongside their plant hosts. Bottom: A simplified Caulimoviridae cladogram (adapted from Fig 4) highlights the three main viral clades: A (orange), B (green), and C (blue). Silhouettes representing plant taxa are sourced from the PhyloPic image repository (https://www.phylopic.org/).

https://doi.org/10.1371/journal.ppat.1014340.g006

  1. (i). Clade C sequences, all grouping within OTU 19, were found exclusively in three extant Araucariaceae species - W. nobilis, A. angustifolia, and A. dammara (the latter being based on SRA data) - each representing one of the family’s three living genera (Fig 4). These species are distributed across remnants of Gondwana (Australasia, Oceania, South America, and South East Asia), which completed breaking apart ~100 MYA during the Cretaceous [47,48]. Molecular dating places their last common ancestor (LCA) at least 150 MYA [41].

The narrow host range of Clade C, which aligns with Araucariaceae biogeography, suggests their LCA hosted episomal Clade C viruses that co-diverged within the family. This supports the idea that Clade C ECVs represent descendants of a Caulimoviridae lineage that originated no later than the mid-Jurassic and evolved alongside their slow-evolving hosts [34]. The aa-ECRTs of Clade C show high levels of pairwise identity, up to 81.3%, 82.9% and 84% between W. nobilis and A. angustifolia, A. angustifolia and A. dammara, and W. nobilis and A. dammara, respectively. Such conservation over long timescales may reflect the slow evolution of both the virus and its host, potentially due to strong selective constraints.

  1. (ii). Clade A sequences were predominantly found in angiosperms. Yet, six paraphyletic OTUs (Solendovirus, Xendovirus gp G. raimondii, Badnavirus, Caulimovirus, Unclassified gp V. vinifera, and OTU 21) also occurred in ANA-grade lineages (Fig 4). In addition, OTU 21, restricted to Nymphaeaceae, occupies a basal position in Clade A, suggesting that diversification of this clade predated or coincided with early angiosperm evolution, ~ 130 MYA [42,46].

The host range of some Clade A OTUs extends beyond angiosperms. For example, OTU 15 spanned five Cupressaceae species, including Cupressus sempervirens and Sequoia sempervirens, whose LCA dates to at least 170 MYA [41]. Two closely related OTUs (OTU 10 and Wendovirus 4) were detected in ferns of the orders Polypodiales (Brainea insignis, Pteris vittata, Rumohra adiantiformis) and Salviniales (Azolla pinnata), which diverged over 280 MYA, predating angiosperms [39]. In addition, OTU 5 ECVs were found in two Gnetaceae species (Fig 4), and Badnavirus transcripts were found in Lycopodiaceae (Fig 5).

  1. (iii). Subclade B.1 sequences were identified across a wide phylogenetic range of seed plants (spermatophytes) hosts (Fig 5), encompassing several Magnoliid species, two ANA-grade basal angiosperm clades, and multiple distantly related gymnosperm families: Araucariaceae, Gnetaceae, Pinaceae, Welwitschiaceae, and the basal lineages Cycadaceae (Cycas panzhihuaensis) and Zamiaceae (Dioon edule, based on transcriptomic evidence, Table 2). These taxa trace back to an LCA of spermatophytes estimated at ~360 MYA [41]. The widespread distribution of subclade B.1 across both early-diverging gymnosperms and basal angiosperms suggests long-term cospeciation with spermatophytes, without significant transmission into more recent angiosperms.

Remarkably, sequences from Welwitschia mirabilis virus 1 and 2 (WMV1 and WMV2), derived from RNA-seq data [18], clustered with subclade B.1. This clustering aligns with the work of Debat & Bejerman [18], who suggested the existence of episomal forms of both viruses, providing indirect evidence for ongoing subclade B.1 infections in gymnosperms. This suggests that subclade B.1 includes both endogenous and episomal forms that have persisted since the early evolution of spermatophytes.

  1. (iv). Subclade B.2 comprises three basally branching, paraphyletic OTU groups associated exclusively with gymnosperms, potentially representing ancient Caulimoviridae lineages (Fig 4): (i) the ‘B.2.Gymnendovirus 2 gp G. biloba 2’ group, found only in Ginkgo biloba; (ii) the ‘B.2.Gymnendovirus 2 gp G. biloba 1’ group, with a broader host range spanning Araucariaceae, Cycadaceae, Ginkgoaceae, Cupressaceae, Taxaceae and Podocarpaceae (Podocarpus rubens, from transcriptomic data, Table 2); (iii) the B.2.1 group comprising four novel OTUs (9, 27, 28 and 29) identified in Ginkgo biloba and several Cupressaceae species. The distribution of these three B.2 groups across deeply divergent gymnosperm lineages, including “living fossils” like Ginkgo biloba, suggests they may be remnants of viral clades that arose early in gymnosperm evolution, possibly at the origin of seed plants ~360 MYA. Their persistence in distinct families may reflect lineage-specific retention and extinction of related viruses, or highly constrained co-evolution with slow-evolving host lineages.
  2. (v). In contrast to the groups in point (iv), the remaining members of subclade B.2 show a broader distribution across tracheophyte lineages (Fig 4), suggesting an older origin and complex diversification. The ‘B.2.Fernendovirus 2’ group is the most basal in subclade B.2, found in distantly related ferns Alsophila spinulosa (Cyatheales) and Dipteris conjugata (Gleicheniales), whose LCA dates back at least 350 MYA [39]. Since ‘B.2.Fernendovirus 2’ appears absent outside ferns, there is little support for a recent cross-division host switch, though host switches between ferns remain plausible. This basal group is sister to the rest of subclade B.2, which divides into two monophyletic groups.

The first monophyletic group includes ‘B.2.Gymnendovirus 1’, found in Cupressaceae, and ‘B.2.Fernendovirus 1’, present in various fern orders, including the basal Ophioglossales, and in lycophytes. The presence of ‘B.2.Fernendovirus 1’ in ferns and lycophytes — two ancient vascular plant lineages — suggests it may have originated near the base of vascular plants (~444 MYA [38]).

The second monophyletic group has two branches. One includes ‘B.2.Petuvirus’, found in ANA-grade angiosperms, magnoliids, and eudicots, and ‘B.2.2’, which spans divergent gymnosperm families, including Ginkgoaceae, Gnetaceae, Cupressaceae, and Araucariaceae, whose LCA dates back at least 350 MYA [41,45]. This broad host range suggests multiple ancient host jumps or a deep caulimovirid lineage that diversified alongside seed plants. The other branch includes ‘B.2.3’, found only in Araucariaceae, indicating a narrow and potentially ancient family-specific association, ‘B.2.Florendovirus’, widespread in basal angiosperms and mesangiosperm, consistent with extensive diversification within flowering plants, and ‘B.2.Gymnendovirus 3-4’, restricted to Pinaceae, suggesting a conifer-specific lineage.

Overall, the distribution of B.2 caulimovirids across ferns, lycophytes, gymnosperms, and angiosperms suggests a mosaic of ancient cospeciation, lineage-specific persistence, and probable host-switching, highlighting the deep and complex macroevolutionary history of this Caulimoviridae subclade.

Discussion

Our study substantially advances understanding of Caulimoviridae diversity by identifying 35 new genus-level OTUs. Of these, 34 enrich the known diversity within Clades A and B, and one defines a previously unrecognized lineage, Clade C. For context, the ICTV currently recognizes 11 genera in the Caulimoviridae family, based on episomal genome structure and phylogeny [6]. By greatly expanding the known genetic diversity of the family, our findings position the Caulimoviridae as a powerful model for exploring the long-term evolutionary dynamics of plant viruses.

We also show that Caulimoviridae have colonized all major divisions of tracheophytes, greatly broadening their recognized host range. Prior to this work, episomal and endogenous caulimovirids had been reported in 158 plant species from 68 distinct families and 59 orders [6,1419,21,28] (S11 and S12 Tables). Here, we detect ECVs and/or ECV transcripts in 302 plant species, expanding the total known host range to 421 plant species across 135 families and 78 orders (S13 Table). The detection of caulimovirid-derived transcripts in basal ferns further suggests that some ECVs remain transcriptionally active. Although this does not imply the production of functional viral genomes, it indicates that these sequences are not universally silent and may retain biological relevance over long evolutionary timescales. Alternatively, these transcripts could derive from ongoing caulimovirid infection of the host samples.

Within this expanded diversity, Clade A OTUs are found primarily in angiosperms, but also in gymnosperms and ferns. Clade B OTUs span the entire tracheophyte spectrum, uniquely including lycophytes. Whereas only a single caulimovirid reverse transcriptase sequence had been previously identified from lycophytes, our detection of ECRTs across multiple lycophyte genomes and transcriptomes confirms that lycophytes are, or have been, genuine hosts of the Caulimoviridae. In contrast, Clade C appears narrowly restricted to the Araucariaceae, suggesting a lineage-specific association. Importantly, plant genome resources remain unevenly distributed across tracheophytes, particularly in lycophytes, ferns, gymnosperms, and basal angiosperms, which likely obscures additional lineage breadth. Although ECRT copy number is correlated with genome size, several species deviated markedly from the overall trend (S1 and S4 Figs). For example, the large gymnosperm genome of Metasequoia glyptostroboides contains only a small number of detectable aa-ECRTs, whereas other gymnosperms with comparable genome sizes harbor thousands. Such a difference underscores that the ECRT quantity likely reflects the combined effects of several factors, including infection severity, host genome dynamics, viral extinctions, and the evolution of host-specific antiviral defenses. In addition, ECRT copy number is likely underestimated in low-contiguity genome assemblies.

To investigate macroevolutionary patterns, we initially posited that (a) Caulimoviridae evolution was shaped by cospeciation with plant hosts, and (b) caulimovirid have relatively narrow host ranges, typically restricted to one or a few closely related families. The latter assumption is consistent with ICTV-recognized genera, which are typically limited to one or a few related plant families rather than spanning broadly across multiple families, except for Badnavirus, whose broad host range likely reflects historical host switching [6,44]. The distribution of ECVs from each OTU across plants (Fig 4) broadly supports a predominantly narrow host range, with notable exceptions such as Florendovirus, which spans all angiosperms.

Under (a) and (b), plant diversification would drive viral diversification, leading to largely parallel host and virus phylogenies, as also demonstrated at the scale of an island ecological community by French et al. [49], who showed that host phylogeny strongly shapes viral transmission networks and contributes to congruent host-virus evolutionary patterns. However, the distribution of Caulimoviridae lineages does not support strict cospeciation from a single ancestral lineage (Figs 4 and 5). Both Clades A and B occur in ferns, gymnosperms, and angiosperms, and several “orphan” lineages are found in basal gymnosperms. Instead, we identified several virus-host distribution sub-patterns consistent with the ancient origin of multiple caulimovirid lineages (Fig 6). We interpret association of closely related OTUs with closely related hosts (at the plant division level) as evidence of vertical transmission extending back to the host LCA. In contrast, associations spanning distantly related hosts likely reflect host switches or incomplete lineage sorting (ILS).

Within this framework, Clade A cospeciation can be traced back at least to early angiosperm evolution ~130 MYA [42,46]. Its sporadic presence in ferns and gymnosperms may represent a deeper origin with differential retention, multiple host switches, or ILS. Subclade B.1 can be traced back to the LCA of seed plants (at least 360 MYA [41]). Subclade B.2 is especially notable for its broadest host range, spanning angiosperms, gymnosperms, ferns, and lycophytes. The ‘B.2 Fernendovirus 1’ lineage, which includes ECVs from both ferns and lycophytes (Fig 6), is particularly informative for reconstructing early virus-plant associations. Detection of B.2 ECVs in early-diverging ferns (e.g., Ophioglossales) supports infection of the euphyllophyte LCA. In contrast, their restriction in lycophytes to the Lycopodiaceae raises the possibility of either ancient vertical transmission or host switching. Thus the most parsimonious hypothesis places the origin of subclade B.2 in the euphyllophyte LCA (~402 MYA [38]). Future expanded genomic resources for Isoetaceae and Selaginellaceae could help resolve this origin, although much of the historical lycophyte biodiversity is extinct [50,51]. Furthermore, while ECVs were absent from all seven bryophyte genomes under investigation, we cannot exclude that the Caulimoviridae host range could extend to the most basal lineages of land plants, and this absence may reflect an incapacity for viral integration or a systematic purge of ECVs in these lineages. Nevertheless, EVEs corresponding to viruses from the phylum Nucleocytoviricota with dsDNA genomes and that integrate passively (lack of an integrase gene), akin to Caulimoviridae, have been identified in the genome of the moss Physcomitrella patens, showing that viral dsDNA can, at least sometimes, be taken up as filler sequences during DNA repair [52]. Alternatively, bryophyte genomes could contain very ancient ECVs that have degraded to the point of being no longer identifiable, leaving open the possibility that Caulimoviridae may have become extinct long ago in this lineage.

Taken together, sub-patterns of cospeciation are consistent with parallel co-evolution of distinct caulimovirid lineages with their respective host lineages (Fig 6). Still, they are incomplete, likely reflecting historical host switches, ILS, and extinction events. Episomal caulimovirids are largely known from symptomatic plants, and ECVs only record viral lineages that successfully integrated into the germline, leaving an unquantifiable number of interactions undetectable. Recurrent antiviral defenses may have eliminated many lineages independently in different plant groups, while mass extinctions, particularly those at the Late Permian-Triassic (~250–200 MYA) and Cretaceous–Paleogene (~66 MYA) [39,45,50,53], further eroded signals of cospeciation (Fig 5).

Integrating viral host range with viral phylogeny, we propose four major caulimovirid lineages: Clade A, Clade C, and subclades B.1 and B.2 within Clade B (Fig 6). Our phylogenetic analysis (Fig 2) further suggests that diversification into Clades A, B, and C predates the emergence of subclade B.2, and likely occurred within the euphyllophyte LCA. This scenario supports multiple ancient speciation events coinciding with the Siluro-Devonian expansion of land plants, which may have created new ecological niches for early establishment of the Caulimoviridae across tracheophytes.

By expanding both sequence diversity and host range, our study firmly establishes the Caulimoviridae as a model system for investigating the origins, diversification, and extinction of plant viruses. As molecular fossils, ECVs uniquely enable the reconstruction of viral macroevolution over timescales spanning hundreds of millions of years, underscoring their crucial role in tracing the deep evolutionary history of plant-virus interactions. Moreover, the evolutionary history of Caulimoviridae is deeply intertwined with that of other plant virus families encoding 30k MP proteins, which are unique to plant viruses and essential for cell-to-cell movement through plasmodesmata. 30k MP likely originated from the coat protein gene of a virus infecting an early vascular plant and subsequently spread across disparate viral lineages through horizontal transfer, a transformative event in the emergence of the modern plant virome [32]. Because many of these MP-encoding lineages predate the Caulimoviridae, our work provides a framework for situating the origin and radiation of the modern plant virome within the broader evolutionary history of land plants.

Methods

Plant genome datasets

To identify and characterize endogenous caulimovirid sequences (ECVs), we assembled three complementary datasets of plant genomic sequences. The first dataset targeted understudied plant groups to explore novel ECV diversity. It comprised 73 publicly available embryophyte genome assemblies (S1 Table). This dataset included seven bryophyte genomes and 66 genomes from basal tracheophytes: nine lycophytes (three Isoetaceae, three Selaginellaceae, and three Lycopodiaceae), 17 ferns (one Gleicheniales, two Cyatheales, five Salviniales, and nine Polypodiales), 27 gymnosperms (seven Cupressaceae, two Taxaceae, two Araucariaceae, eleven Pinaceae, two Gnetaceae, one Cycadaceae, one Welwitschiaceae, one Ginkgoaceae), three ANA-grade angiosperms (two Nymphaeaceae, one Amborellaceae), one Chloranthales, and nine Magnoliids (one Aristolochiaceae, two Magnoliaceae, two Calycanthaceae, four Lauraceae). The second dataset was designed to complement the first by including seven angiosperm genomes: Lindenbergia philippensis, Citrus sinensis, Rosa chinensis, Ricinus communis, Nicotiana sylvestris, Vitis vinifera, and Glycine max (S5 Table). A third dataset was compiled, including 14 additional angiosperm genomes representing major lineages: Dioscorea alata, Musa balbisiana, Brachypodium distachyon, Oryza sativa, Papaver somniferum, Nelumbo nucifera, Beta vulgaris, Vaccinium darrowii, Rudbeckia hirta, Solanum lycopersicum, Capsicum annuum, Arabidopsis lyrata, Pyrus communis, and Fragaria vesca, while excluding the Nicotiana sylvestris genome (S7 Table).

Plant transcriptomic datasets

To identify ECVs and Caulimoviridae transcripts, we analyzed four publicly available transcriptomic datasets: (1) NCBI Transcriptome Shotgun Assemblies (TSA, https://www.ncbi.nlm.nih.gov/) [24], (2) the “1,000 Plants (1KP)” project by Leebens-Mack (2019; https://sites.google.com/a/ualberta.ca/onekp/) [25], (3) a curated set of lycophyte transcriptomes provided by Xia et al. (2022) [27], and (4) fern transcriptomes compiled by Ali et al. (2024; https://conekt.sbs.ntu.edu.sg/species/) [26].

Collection of reference Caulimoviridae amino acid RT sequences

A reference library of 106 amino acid reverse transcriptase (aa-RT) sequences homologous to the RT domain of cauliflower mosaic virus (CaMV; GenBank accession number V00141, positions 4,449-5,648) was assembled (S14 Table). It includes 98 aa-RT sequences from Caulimoviridae comprising:

  1. (i). 16 from episomal viruses, spanning all eleven ICTV-recognized genera [6]: Badnavirus (n = 4), Caulimovirus (n = 1), Cavemovirus (n = 1), Dioscovirus (n = 1), Petuvirus (n = 1), Rosadnavirus (n = 1), Ruflodivirus (n = 1), Solendovirus (n = 1), Soymovirus (n = 3), Tungrovirus (n = 1), Vaccinivirus (n = 1);
  2. (ii). 3 from putative caulimovirids: Welwitschia mirabilis virus 1 and 2 (WMV1, WMV2) [18] and Pinus nigra virus [28];
  3. (iii). 79 from ECVs, including: one from the proposed genus Orendovirus, represented by Aegilops tauschii virus [21]; 11 from the proposed genus Florendovirus [14]; 56 from the proposed genera Gymnendovirus 1–4 (n = 22), Xendovirus (n = 2), Yendovirus (n = 4), Zendovirus (n = 1), Petuvirus-like (n = 1), Fernendovirus 1–2 (n = 4); 19 from genera Badnavirus (n = 10), Caulimovirus (n = 1), Petuvirus (n = 7), Soymovirus (n = 1) [16]; five Beta Endogenous Caulimovirus-like Viruses (BECVs [17]); six from the proposed genus Wendovirus [19].

To root phylogenetic trees, eight outgroup sequences were included: seven from Metaviridae and Retroviridae, curated in the GyDB database [22], and one additional Metaviridae RT sequence from Anthoceros agrestis.

Caulifinder libraries

Caulifinder branch A [20] is distributed with two core libraries. The “Caulimoviridae_ref_genomes” library, comprising full-length reference caulimovirid genomic sequences, is used for the initial detection of ECVs. The “baits” library, containing ORFs and conserved protein domains from various members of the order Ortevirales, is used to filter out false positives [20,54]. Caulifinder branch B [20] is delivered with three distinct libraries. “Caulimo_RT_probes” contains caulimovirid aa-RT sequences used for the initial detection of ECRTs. “Tree_RT_set_OUTGP” includes aa-RT sequences from Caulimoviridae and other Ortervirales, and serves to filter out false positives, and “Tree_RT_set" includes aa-RT sequences from Caulimoviridae and reference Ortervirales to generate phylogenetic trees of the detected ECRTs [20,47].

To improve detection sensitivity and taxonomic resolution, the complete genome sequences of Gymnendovirus 5, 6, 7, and 8, Gnetovirus, and Sequoiavirus [55] (Serfraz, personal communication), were integrated into Caulifinder branch A reference genome library. All translated protein sequences ≥ 100 aa in length were incorporated into the “baits” library. The corresponding aa-RT sequences were also added to both the “Caulimo_RT_probes” and “Tree_RT_set" libraries of branch B. To further enhance discrimination between bona fide viral sequences and unrelated elements, we supplemented the “baits” library with 545 RT and pol domain sequences from Metaviridae and Retroviridae, retrieved from the GyDB database [22]. These augmented resources are hereafter referred to as customized libraries.

Detection and Clustering of endogenous caulimovirid aa-RT domains (aa-ECRTs)

To identify amino acid sequences corresponding to endogenous caulimovirid RT domains (aa-ECRTs), Caulifinder branch B [20] was applied to 73 plant genome assemblies using customized RT libraries. The pipeline was also run on transcriptomic datasets from the 1 KP project and curated lycophyte and fern transcriptomes. In parallel, a tBLASTn search was conducted against the NCBI TSA database using the same custom RT sequences. Due to the database size, this search was restricted to transcriptomic data from five major plant groups: Bryophytes (taxid:3208), Lycophytes (taxid:1521260), Ferns (taxid:241806), Gymnosperms (taxid:1437180), and Angiosperms (taxid:3398). Representative aa-ECRT sequences from these sources were combined with the reference library of 106 RT amino acid sequences.

Pairwise similarity comparisons were performed using BLASTp [56], and the results were used to cluster sequences with the Markov Cluster Algorithm (MCL) [57] at 62% identity (inflation parameter I = 2). This cut-off was selected to align with genus-level clustering, ensuring clusters approximate recognized caulimovirid genera while resolving distinct OTUs within them. Representative sequences for each cluster were selected based on: (i) a minimum length of 150 amino acids, (ii) presence of the conserved ‘DD’ catalytic motif (aligned to positions 145–146 of the CaMV RT domain) and (iii) good alignment quality.

Phylogeny of the Caulimoviridae

The phylogenetic reconstruction of the Caulimoviridae family was based on nucleotide sequences encoding the RT-RH1 domain homologous to that of the cauliflower mosaic virus (CaMV, GenBank accession number V00141, positions 4,449-5,648), the type virus of genus Caulimovirus [6]. To investigate novel OTUs, homologous sequences were recovered from plant genome and transcriptome datasets. Representative sequences were compiled from three sources:

  1. (i). Plant genomes: the “sequence retriever” module of Caulifinder branch A [20] was run with default parameters on the original 73 embryophyte genomes (S1 Table) plus seven additional angiosperm genomes: Lindenbergia philippensis, Citrus sinensis, Rosa chinensis, Ricinus communis, Nicotiana sylvestris, Vitis vinifera, and Glycine max (S5 Table). Consensus sequences were generated when ≥ five ECV copies shared ≥85% nucleotide identity; only those spanning the RT-RH1 domain were retained. Sequences exhibiting <41% BLASTp identity to reference aa-ECRT sequences were excluded, a threshold set to the highest similarity between a caulimovirid RT and that of the Metaviridae Ty3, ensuring conservative filtering. Sequences were manually selected by retaining consensus with at least three copies covering the RT-RH1 domain.
  2. (ii). Genomic copies: when no or a single consensus could be obtained, unique ECRT loci were extracted using positional information from Caulifinder branch B [20] via the “marker miner” module. A 2 kbp flanking region was extracted on both ends to ensure complete domain recovery.
  3. (iii). Transcriptomic data: for Brainea insignis, for which no complete genome sequence was available, an RT-RH1 domain-containing sequence was retrieved from transcriptomic data (NCBI, accession: GFUE01038349.1).

Sequences were assigned to OTUs by BLASTx (e-value ≤ 1e-5) against the 361 aa-RT sequences used to define the OTU with ≥ 62% identity over ≥ 150 residues. Nucleotide sequences < 1,000 bp or producing long branches in preliminary phylogenetic trees were discarded. Up to four representatives were kept per OTU. Alignments used MAFFT v7.511 [58] with the G-INS-i strategy and 1,000 iterative refinements. Several alignment trimming strategies were evaluated; however, the untrimmed alignment yielded the most robust and accurate phylogenetic signal. Phylogenetic trees were subsequently inferred using two complementary methods:

  1. (i). Maximum Likelihood (ML) with IQ-TREE v2.1.2 [59], 1,000 ultrafast bootstrap replicates, best-fit substitution model (GTR + F + I + G4) via ModelFinder [60].
  2. (ii). Bayesian Inference with MrBayes v3.2.7 [2931], under the GTR + I + Γ model using 2,000,000 generations, sampling every 500. Convergence was assessed using Tracer v1.7.2 [61], with 10% burn-in before consensus tree calculations.

Quantification and classification of nt-ECRTs across plant genomes

To quantify nt-ECRTs across our plant genome dataset, each nt-ECRT identified by Caulifinder branch B [20] was queried against the representative aa-ECRT library used to define OTUs, using BLASTx (e‑value ≤ 1 × 10 ⁻ ⁵). Assignments to OTUs were made based on each nt-ECRT’s top BLASTx hit, requiring ≥ 65% amino acid identity over ≥ 150 residues (the length of the smallest reference aa-RT domain). We adopted a 65% identity cutoff, more stringent than the 62% threshold used for clustering aa-RTs into OTUs, to minimize misclassification and improve specificity in quantifying nt-ECRT abundance across plant genomes.

Classification of RT sequences detected in transcriptomic data

The classification of nt-ECRTs recovered from transcriptomic datasets was conducted using BLASTx against the set of aa-RT sequences used to define OTUs.

Each transcript-derived nt-ECRT was assigned to an OTU based on its best BLASTx hit, provided the alignment met a minimum identity threshold of 62% across ≥ 150 amino acids, matching the criteria used during OTU clustering. This consistent framework allowed reliable classification of transcript-derived sequences within the established Caulimoviridae diversity landscape.

Viral genome assembly from ECVs

The WolV1 genome was assembled from ECVs identified in the W. nobilis genome, as defined previously [14]. Briefly, the 5’ and 3’ genomic positions of nt-ECRTs assigned to OTU 19 were extended by 10 kbp, and the sequences of the corresponding loci were extracted. These sequences were compared using an all-against-all BLASTn search to identify high-scoring sequences from different loci that contained identical or near-identical sequences. Fragments of the virus sequence were assembled using VECTOR NTI Advance 10.3.1 (Invitrogen), operated with default settings, except that the maximum clearance values for error rate and maximum gap length were increased to 500 and 200, respectively.

Phylogeny of Ortervirales

To investigate the evolutionary relationships within Ortervirales, a representative set of 52 amino acid sequences corresponding to the RT domains of viruses from its five recognized families (Belpaoviridae, Pseudoviridae, Retroviridae, Metaviridae, and Caulimoviridae) was compiled [7,22] (S6 Table).

Sequences were aligned using MAFFT v7.511 [58] with the G-INS-i global alignment strategy and up to 1,000 iterations. An ML phylogeny was reconstructed using IQ-TREE v1.6.12 [59], with 1,000 bootstrap replicates for branch support. The rtREV + G4 substitution model was selected automatically using the Bayesian Information Criterion (BIC) via ModelFinder [60].

Phylogeny of the 30K movement protein (MP) family

The phylogenetic analysis of the 30K movement protein (MP) family was conducted using amino acid sequences derived from Caulimoviridae and homologs from other plant virus families. MP domains from caulimovirid nucleotide sequences were initially identified by annotating protein domains in ECV loci and consensus sequences, and reference genomes using rpsblastn [56] with the CDD database (e‑value ≤ 1) [33]. Ambiguous matches were further validated against the Pfam database using HMMER3 [62,63]. A total of 46 sequences representing the diversity of Clades A, B, and C OTUs were selected and translated, ten of which belonged to Clade C. MP domains retrieved from plant genomic loci were classified following the OTU assignment of their most proximal ECRT within a range of 5 kbp upstream and downstream, following the positions encoding RT and MP domains in ICTV-characterized genomes [6].

To contextualize caulimovirid MP diversity within broader viral evolution, an additional 286 MP sequences were sourced from Butkovic et al. [32], spanning 14 plant virus families: Alphaflexiviridae, Aspiviridae, Betaflexiviridae, Bromoviridae, Botourmiaviridae, Fimoviridae, Geminiviridae, Kitaviridae, Mayoviridae, Phenuiviridae, Rhabdoviridae, Secoviridae, Tospoviridae, and Virgaviridae.

In total, 332 MP sequences were aligned using MAFFT 7.511 [58] with the G-INS-I global alignment strategy and up to 1000 iterations. The alignment was trimmed with trimAl 1.4 [64] using the -gt 0.5 option to remove poorly aligned columns. ML phylogeny was inferred using IQ-TREE 2.1.4 [59], with 1000 Sh-aLRT and 1000 ultrafast bootstrap replicates. The LG + F + G4 substitution model was selected based on BIC via ModelFinder [60]. The resulting tree was midpoint-rooted to facilitate the interpretation of clade relationships.

Search for Clade C RTs in Agathis dammara raw sequencing data

To investigate the potential presence of OTU 19 (Clade C) RT sequences in the genus Agathis (Araucariaceae), we analyzed Illumina whole-genome sequencing data from Agathis dammara (SRR15616211), which were initially generated for a mitogenome sequencing project (NCBI bioproject 757934). The dataset comprises 3.4 Gbp, representing approximately 0.12% of the estimated 27 Gbp genome size [65]. The OTU 19 aa-ECRT rep from W. nobilis was used as a query in tBLASTn searches via the NCBI web interface [24,56]. To maximize detection across the entire RT domain, the probe was segmented into consecutive 50 amino acid fragments, each querying the short-read dataset independently. The top ten hits per fragment were pooled and assembled using CAP3 [66], yielding a 529 nt contig, which was translated into a 175 amino acid RT sequence designated as the OTU 19 aa-ECRT rep from A. dammara and validated by BLASTx against the curated set of 361 reference RT sequences used in OTU classification.

Supporting information

S1 Fig. Relationship between the number of detected aa-ECRTs and host plant genome size (excluding Ns and contigs of length < 500 bp).

This figure displays data from 64 plant genomes, with species labels corresponding to those shown in Fig 2. For simplicity, Chloranthus sessilifolius is grouped within Magnoliidae. A linear regression analysis was conducted using R version R 4.3.2 (R Core Team, 2021), resulting in a slope of 0.054, a p-value of 9e-12, and an R² of 0.5.

https://doi.org/10.1371/journal.ppat.1014340.s001

(TIF)

S2 Fig. Type and number of sequences used for phylogenetic analyses.

The y-axis indicates the number of sequences, while each bar is segmented by sequence type, as indicated by the color code on the right. Sequence types include reference sequences, consensus sequences, and transcriptome-derived sequences. OTUs are organized along the x-axis, grouped by clade or genus.

https://doi.org/10.1371/journal.ppat.1014340.s002

(TIF)

S3 Fig. Bayesian (a) and maximum likelihood (b) phylogenies of Caulimoviridae.

Both trees were generated from an alignment of 142 nucleotide sequences corresponding to the RT-RH1 domain of Caulimoviridae. The (Ty3) sequence shown in red was used as an outgroup to construct both trees. Sequence labels are formatted as follows: OTU name | sequence name. ICTV reference sequences retain their original names, while other sequences are labeled as host_name copy/consensus ID_number. Newly identified OTUs in this study are shown in blue; reference OTUs are shown in black. a: The Bayesian phylogeny was inferred using MrBayes. Red circles on branches indicate posterior probabilities ≥ 0.95. b: The Maximum likelihood phylogeny was inferred using IQ-TREE with the best-fitting nucleotide substitution model as determined by ModelFinder. Red circles denote branches with bootstrap support ≥ 70.

https://doi.org/10.1371/journal.ppat.1014340.s003

(TIF)

S4 Fig. Relationship between the number of nt-ECRTs and host plant genome size (excluding Ns and contigs of length < 500 bp).

(a) Number of nt-ECRTs plotted against genome size for 86 tracheophyte species. Linear regression performed in R 4.3.2 (R Core Team, 2021) yielded a slope of 0.08, p-value = 9 × 10 ⁻ ¹³, and R² = 0.45. Host plant tags are omitted in this panel. (b) Relationship for nine lycophyte genomes. The regression slope is 0.02, with a p-value = 4 × 10 ⁻ ⁴ (after correction for multiple testing using Benjamini-Hocheberg -BH- method), and R² = 0.85. All host plant tags are indicated. (c) Relationship for 17 fern genomes. The regression slope is 0.07, with a p-value = 0.03 (after correction for multiple testing with the BH method) and R² = 0.24. (d) Relationship for 27 gymnosperm genomes. The regression slope is 0.12, with a p-value = 2.8 × 10 ⁻ ⁴ (after correction for multiple testing with the BH method), and R² = 0.46. Host plant tags are partially shown. (e) Relationship for 33 angiosperm genomes. The regression slope is 0.43, with a p-value = 0.008, and R² = 0.20. Host plant tags are partially shown.

https://doi.org/10.1371/journal.ppat.1014340.s004

(TIF)

S5 Fig. Taxonomic coverage of the genomic dataset across tracheophytes.

The cladogram illustrates the taxonomic representation of tracheophytes in the genomic dataset used in this study. Plant groups are shown at the family level for lycophytes, gymnosperms, and angiosperms, and the order level for ferns and angiosperms. Groups highlighted in red indicate taxa for which genomic sequences are included in the primary plant genome dataset (Fig 1), whereas groups in blue denote taxa not represented by genomic data.

https://doi.org/10.1371/journal.ppat.1014340.s005

(TIF)

S6 Fig. Phylogenetic placement of transcript-derived nt-RT-RH sequences from Botrypus virginianus and Lindsaea linearis within Caulimoviridae.

The phylogenetic tree was inferred from 142 nucleotide sequences corresponding to the RT-RH1 domain of Caulimoviridae, using IQ-TREE 2.1.2 under the GTR + F + I + G4 substitution mode. The tree is rooted with a Metaviridae sequence. Only clade B is shown. Red circles indicate nodes with bootstrap support ≥ 70. Two sequences highlighted in blue originate from the 1KP transcriptomic dataset: TSA | B. virginianus corresponds to scaffold BEGM-2004510 from Botrypus virginianus, and TSA | L. linearis corresponds to scaffold NOKI-2097008 from Lindsaea linearis.

https://doi.org/10.1371/journal.ppat.1014340.s006

(TIF)

S7 Fig. Phylogenetic placement of Austrobaileya scandens nt-RT-RH transcripts within the Caulimoviridae.

The phylogenetic tree comprises 142 nucleotide sequences corresponding to the RT-RH1 domain of Caulimoviridae, inferred using IQ-TREE 2.1.2 under the GTR + F + I + G4 substitution model. The tree is rooted with a Metaviridae reference sequence. Only clades B.2.2, B.2. Florendovirus, and B.2. Gymnendovirus 3–4 are depicted. Bootstrap support values ≥ 70 are indicated by red circles. Two sequences derived from transcriptomic data in the 1KP project are highlighted in blue: scaffold-FZJL-2010292-Austrobaileya and scaffold-FZJL-2022942-Austrobaileya.

https://doi.org/10.1371/journal.ppat.1014340.s007

(TIF)

S1 Table. Characteristics of 73 Embryophyte genomes.

This table presents comprehensive details for each of the 73 plant genomes included in the first genomic dataset: (i) Taxonomic information: group (column 1), order (column 2), family (column 3), species name (column 4); (ii) Tag name (column 5): shortened label derived from the species name, used for clarity in graphical representations; (iii) Genomes metrics: size in Mbp (column 6); N50 value(assembly quality measure in kbp), calculated using QUAST (Gurevich et al., 2013) (column 7); (iv) Reference information: Genome Link (column 8): Source or database accession providing access to the genome data.

https://doi.org/10.1371/journal.ppat.1014340.s008

(XLSX)

S2 Table. List of OTUs within the Caulimoviridae.

This table describes the composition of the caulimovirid OTUs obtained using a 62% identity threshold for RT clustering. New OTUs are highlighted in blue and numbered from 1 to 35, while reference OTUs are shown in grey and named according to their included reference sequences. Outgroup sequences are indicated in red. Reverse transcriptases (RT) sequences are labeled either by their reference name or by a Caulifinder-assigned number followed by the host species tag. When multiple references originate from different studies, all are included in the RT sequence name.

https://doi.org/10.1371/journal.ppat.1014340.s009

(XLSX)

S3 Table. Results of the analysis of 66 plant genome sequences using Caulifinder branch A.

For each genome in the dataset of 66 tracheophytes species, the table provides: (i) Taxonomic details including division (column 1), and species name (column 2); (ii) Caulifinder branch B results with presence/absence of aa-ECRT (column 3); (iii) Caulifinder branch A results with the number of consensus built (column 4), the number of consensus left after filtering the false positive (see material and methods) (column 5), information relative to the size of the filtered consensus (column 6, 7 and 8), and the number of consensuses with a RT-RH1 domain (column 9).

https://doi.org/10.1371/journal.ppat.1014340.s010

(XLSX)

S4 Table. List of 143 RT-RH1 sequences used for the phylogenetic analysis of the Caulimoviridae.

The dataset comprises 143 sequences: 142 from the Caulimoviridae family and one from the Metaviridae Ty3 element, which serves as an outgroup for phylogenetic rooting. Each sequence label is unique. Reference sequences retain their established names. Non-reference sequences are identified by the host species name, followed by the type of sequence (“copy” or “cons” for consensus), and a unique identifier. New OTUs are numbered from 1 to 35. Reference OTUs are named according to the reference sequences they include. The text referring to genera or OTUs that lack fully characterized sequences (either copy or consensus) is in red.

https://doi.org/10.1371/journal.ppat.1014340.s011

(XLSX)

S5 Table. List of 7 angiosperm genomes and results of Caulifinder branchA.

For each genome in the dataset of 7 angiosperm species, the table provides: (i) Taxonomic details including species name (column 1), and group (column 2); (ii) Tag name (column 3) used in graphical representation for clarity; (iii) Genome details with link to genome sequence (column 4), genome size (column 5) and N50 (column 6); (iv) Caulifinder branch B results with presence/absence of aa-ECRT (column 7); (vi) Caulifinder branch A results with the number of consensus built (column 8), the number of consensus left after filtering out false positives (see material and methods) (column 9), and information relative to the size of the filtered consensus (columns 10–12).

https://doi.org/10.1371/journal.ppat.1014340.s012

(XLSX)

S6 Table. List of 52 RT sequences used to establish the Ortervirales phylogeny.

Sequences from families other than Caulimoviridae are highlighted in red. Reference OTUs are highlighted in grey and are labeled based on their corresponding reference sequences. Sequences associated with new operational taxonomic units (OTUs) are highlighted in blue.

https://doi.org/10.1371/journal.ppat.1014340.s013

(XLSX)

S7 Table. List of 14 mesangiosperm genome sequences used for nt-ECRT mapping.

Information is organized as follows: (i) Taxonomic information: species name (column 1), division (column 2), group (column 3); (ii) Tag name: a simplified label derived from the species name used in certain graphical representations (column 4); (iii) Genomes metrics: genome size in Mbp (column 5) and N50 value in kbp calculated using QUAST (Gurevich et al., 2013) (column 6); (iv) Link or reference to genome data (column 7).

https://doi.org/10.1371/journal.ppat.1014340.s014

(XLSX)

S8 Table. ECRT count in 93 embryophytes genomic sequences.

For each plant species, the following information is provided: (i) plant group classification; (ii) species name; (iii) abbreviated Tag name used in graphical representations; (iv) total number of detected amino acid-based ECRTs (aa-ECRTs); (v) total number of nucleotide-based ECRTs (nt-ECRTs); (vi) fold change between the nt-ECRT and aa-ECRT counts.

https://doi.org/10.1371/journal.ppat.1014340.s015

(XLSX)

S9 Table. OTU best hit counts across tracheophyte genomes.

Each column represents a plant genome, while each row corresponds to an OTU. The values indicate how many nt-ECRT sequences from a given plant genome were most similar to sequences within each OTU. This matrix enables visualization of OTU distribution and abundance across host species, supporting comparative analyses of Caulimoviridae diversity within the tracheophyte clade.

https://doi.org/10.1371/journal.ppat.1014340.s016

(XLSX)

S10 Table. Results of BLASTx analysis of caulimovirid transcripts.

This table presents the outcomes of the BLASTx analysis conducted between caulimovirid transcripts and the 361 amino acid reverse transcriptase (aa-RT) sequences derived from OTU clustering. The table includes the following information for each dataset: (i) Reference information (column 1); (ii) Plant division represented in the transcriptomes (column 2); (iii) Total number of BLASTx hits (column 3); (iv) Number of hits with a length ≥ 150 amino acids (aa) (column 4); (v) Number of hits with a length ≥ 150 aa originating from transcripts associated with underrepresented or unstudied plant families (column 5).

https://doi.org/10.1371/journal.ppat.1014340.s017

(XLSX)

S11 Table. Plants species hosting episomal Caulimoviridae.

This table presents the plant species hosting episomal caulimovirid viruses according to the ICTV (Teycheney et al. 2020): (i) Virus name (column 1); (ii) Number of virus (column 2); (iii) Acronym (column 3); (iv) Taxonomic information about the host plant: Species name, Order, Family, Class (column 4,5,6,7); (v) Number of host plant (column 8); (vi) Reference (column 9).

https://doi.org/10.1371/journal.ppat.1014340.s018

(XLSX)

S12 Table. Plants species hosting Caulimoviridae.

This table presents the plant species hosting ECV or caulimovirids with unknown status: (i) Taxonomic information about the host plant: Species name, Order, Family, Class (column 1,2,3,4); (iii) Number of host plant (column 5); (iv) Reference (column 6).

https://doi.org/10.1371/journal.ppat.1014340.s019

(XLSX)

S13 Table. Plants species hosting ECV.

This table presents the plant species hosting ECV according to detection in genomics data: (i) Taxonomic information about the host plant: Species name, Order, Family, Class (column 1,2,3,4); (iii) Number of host plant (column 5); (iv) Reference (column 6).

https://doi.org/10.1371/journal.ppat.1014340.s020

(XLSX)

S14 Table. Reference dataset of 98 Caulimoviridae aa-RT and 8 outgroup sequences.

(i) Sequence name (column 1); (ii) Reference information and/or link (column 2 and column 3).

https://doi.org/10.1371/journal.ppat.1014340.s021

(XLSX)

Acknowledgments

The authors thank Christophe Plomion (INRAE) for providing access to the Cupressus sempervirens genome sequence ahead of publication. The authors thank Anamarija Butkovic for providing access to plant viral 30K MP sequences and Thierry Candresse for critical reading of the manuscript.

References

  1. 1. Lefeuvre P, Martin DP, Elena SF, Shepherd DN, Roumagnac P, Varsani A. Evolution and ecology of plant viruses. Nat Rev Microbiol. 2019;17(10):632–44. pmid:31312033
  2. 2. Rubino L, Abrahamian P, An W, Aranda MA, Ascencio-Ibañez JT, Bejerman N, et al. Summary of taxonomy changes ratified by the International Committee on Taxonomy of Viruses from the Plant Viruses Subcommittee, 2025. J Gen Virol. 2025;106(7):002114. pmid:40711908
  3. 3. Dolja VV, Krupovic M, Koonin EV. Deep roots and splendid boughs of the global plant virome. Annu Rev Phytopathol. 2020;58:23–53. pmid:32459570
  4. 4. Aiewsakun P, Katzourakis A. Endogenous viruses: connecting recent and ancient viral evolution. Virology. 2015;479–480:26–37. pmid:25771486
  5. 5. Vassilieff H, Geering ADW, Choisne N, Teycheney P-Y, Maumus F. Endogenous caulimovirids: fossils, zombies, and living in plant genomes. Biomolecules. 2023;13(7):1069. pmid:37509105
  6. 6. Teycheney PY, Geering ADW, Dasgupta I, Hull R, Kreuze JF, Lockhart B, et al. ICTV virus taxonomy profile: caulimoviridae. J Gen Virol. 2020;101(10):1025–6. pmid:32940596
  7. 7. Krupovic M, Blomberg J, Coffin JM, Dasgupta I, Fan H, Geering AD, et al. Ortervirales: new virus order unifying five families of reverse-transcribing viruses. J Virol. 2018;92(12):e00515-18. pmid:29618642
  8. 8. Ndowora T, Dahal G, LaFleur D, Harper G, Hull R, Olszewski NE, et al. Evidence that badnavirus infection in Musa can originate from integrated pararetroviral sequences. Virology. 1999;255(2):214–20. pmid:10069946
  9. 9. Harper G, Osuji JO, Heslop-Harrison JS, Hull R. Integration of banana streak badnavirus into the Musa genome: molecular and cytogenetic evidence. Virology. 1999;255(2):207–13. pmid:10069945
  10. 10. Jakowitsch J, Mette MF, van Der Winden J, Matzke MA, Matzke AJ. Integrated pararetroviral sequences define a unique class of dispersed repetitive DNA in plants. Proc Natl Acad Sci U S A. 1999;96(23):13241–6. pmid:10557305
  11. 11. Dallot S, Acuña P, Rivera C, Ramírez P, Côte F, Lockhart BE, et al. Evidence that the proliferation stage of micropropagation procedure is determinant in the expression of banana streak virus integrated into the genome of the FHIA 21 hybrid (Musa AAAB). Arch Virol. 2001;146(11):2179–90. pmid:11765919
  12. 12. Richert-Pöggeler KR, Noreen F, Schwarzacher T, Harper G, Hohn T. Induction of infectious petunia vein clearing (pararetro) virus from endogenous provirus in petunia. EMBO J. 2003;22(18):4836–45. pmid:12970195
  13. 13. Lockhart BE, Menke J, Dahal G, Olszewski NE. Characterization and genomic analysis of tobacco vein clearing virus, a plant pararetrovirus that is transmitted vertically and related to sequences integrated in the host genome. J Gen Virol. 2000;81(Pt 6):1579–85. pmid:10811941
  14. 14. Geering ADW, Maumus F, Copetti D, Choisne N, Zwickl DJ, Zytnicki M, et al. Endogenous florendoviruses are major components of plant genomes and hallmarks of virus evolution. Nat Commun. 2014;5:5269. pmid:25381880
  15. 15. Mushegian AR, Elena SF. Evolution of plant virus movement proteins from the 30K superfamily and of their homologs integrated in plant genomes. Virology. 2015;476:304–15. pmid:25576984
  16. 16. Diop SI, Geering ADW, Alfama-Depauw F, Loaec M, Teycheney PY, Maumus F. Tracheophyte genomes keep track of the deep evolution of the Caulimoviridae. Sci Rep. 2018;8:572.
  17. 17. Gong Z, Han G-Z. Euphyllophyte paleoviruses illuminate hidden diversity and macroevolutionary mode of Caulimoviridae. J Virol. 2018;92(10):e02043-17. pmid:29491164
  18. 18. Debat H, Bejerman N. A glimpse into the DNA virome of the unique “living fossil” Welwitschia mirabilis. Gene. 2022;843:146806. pmid:35963497
  19. 19. de Tomás C, Vicient CM. Genome-wide identification of Reverse Transcriptase domains of recently inserted endogenous plant pararetrovirus (Caulimoviridae). Front Plant Sci. 2022;13:1011565. pmid:36589050
  20. 20. Vassilieff H, Haddad S, Jamilloux V, Choisne N, Sharma V, Giraud D, et al. CAULIFINDER: a pipeline for the automated detection and annotation of caulimovirid endogenous viral elements in plant genomes. Mob DNA. 2022;13(1):31. pmid:36463202
  21. 21. Geering ADW, Scharaschkin T, Teycheney P-Y. The classification and nomenclature of endogenous viruses of the family Caulimoviridae. Arch Virol. 2010;155(1):123–31. pmid:19898772
  22. 22. Lloréns C, Futami R, Bezemer D, Moya A. The Gypsy Database (GyDB) of mobile genetic elements. Nucleic Acids Res. 2008;36(Database issue):D38-46. pmid:17895280
  23. 23. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5. pmid:23422339
  24. 24. Sayers EW, Beck J, Bolton EE, Brister JR, Chan J, Connor R, et al. Database resources of the National Center for Biotechnology Information in 2025. Nucleic Acids Res. 2025;53(D1):D20–9. pmid:39526373
  25. 25. One thousand plant transcriptomes and the phylogenomics of green plants. Nature. 2019;574(7780):679–85. pmid:31645766
  26. 26. Ali Z, Tan QW, Lim PK, Chen H, Pfeifer L, Julca I, et al. Comparative transcriptomics in ferns reveals key innovations and divergent evolution of the secondary cell walls. Nat Plants. 2025;11(5):1028–48. pmid:40269175
  27. 27. Xia Z-Q, Wei Z-Y, Shen H, Shu J-P, Wang T, Gu Y-F, et al. Lycophyte transcriptomes reveal two whole-genome duplications in Lycopodiaceae: insights into the polyploidization of Phlegmariurus. Plant Divers. 2021;44(3):262–70. pmid:35769590
  28. 28. Rastrojo A, Núñez A, Moreno DA, Alcamí A. A new putative Caulimoviridae genus discovered through air metagenomics. Microbiol Resour Announc. 2018;7(14):e00955-18. pmid:30533707
  29. 29. Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics. 2003;19(12):1572–4. pmid:12912839
  30. 30. Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics. Oxford Academic [Internet]. [cited 2026 Apr 24]. Available from: https://academic.oup.com/bioinformatics/article/17/8/754/235132
  31. 31. Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Höhna S, et al. MrBayes 3.2: Efficient Bayesian phylogenetic inference and model choice across a large model space. Syst Biol. 2012;61(3):539–42. pmid:22357727
  32. 32. Butkovic A, Dolja VV, Koonin EV, Krupovic M. Plant virus movement proteins originated from jelly-roll capsid proteins. PLoS Biol. 2023;21(6):e3002157. pmid:37319262
  33. 33. Wang J, Chitsaz F, Derbyshire MK, Gonzales NR, Gwadz M, Lu S, et al. The conserved domain database in 2023. Nucleic Acids Res. 2023;51(D1):D384–8. pmid:36477806
  34. 34. Stevenson DW, Ramakrishnan S, de Santis Alves C, Coelho LA, Kramer M, Goodwin S, et al. The genome of the Wollemi pine, a critically endangered “living fossil” unchanged since the Cretaceous, reveals extensive ancient transposon activity. bioRxiv. 2023:2023.08.24.554647. pmid:37662366
  35. 35. Marchioro CA, Santos KL, Siminski A. Present and future of the critically endangered Araucaria angustifolia due to climate change and habitat loss. Forestry. 2019;93(3):401–10.
  36. 36. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B: Stat Methodol. 1995;57(1):289–300.
  37. 37. Kumar S, Suleski M, Craig JM, Kasprowicz AE, Sanderford M, Li M, et al. TimeTree 5: an expanded resource for species divergence times. Mol Biol Evol. 2022;39(8):msac174. pmid:35932227
  38. 38. Morris JL, Puttick MN, Clark JW, Edwards D, Kenrick P, Pressel S, et al. The timescale of early land plant evolution. Proc Natl Acad Sci U S A. 2018;115(10):E2274-83. pmid:29463716
  39. 39. Lehtonen S, Silvestro D, Karger DN, Scotese C, Tuomisto H, Kessler M, et al. Environmentally driven extinction and opportunistic origination explain fern diversification patterns. Sci Rep. 2017;7:4831. pmid:28684788
  40. 40. Rothwell GW, Scheckler SE, Gillespie WH. Elkinsia gen. nov., a Late Devonian gymnosperm with cupulate ovules. Bot Gaz. 1989;150(2):170–89.
  41. 41. Stull GW, Qu X-J, Parins-Fukuchi C, Yang Y-Y, Yang J-B, Yang Z-Y, et al. Gene duplications and phylogenomic conflict underlie major pulses of phenotypic evolution in gymnosperms. Nat Plants. 2021;7(8):1015–25. pmid:34282286
  42. 42. Herendeen PS, Friis EM, Pedersen KR, Crane PR. Palaeobotanical redux: revisiting the age of the angiosperms. Nat Plants. 2017;3:17015. pmid:28260783
  43. 43. Nitta JH, Schuettpelz E, Ramírez-Barahona S, Iwasaki W. An open and continuously updated fern tree of life. Front Plant Sci. 2022;13:909768. pmid:36092417
  44. 44. Alvarez-Quinto RA, Lockhart BEL, Fetzer JL, Olszewski NE. Genomic characterization of cycad leaf necrosis virus, the first badnavirus identified in a gymnosperm. Arch Virol. 2020;165(7):1671–3. pmid:32335770
  45. 45. McLoughlin S. Gymnosperms. In: Encyclopedia of geology [Internet]. Elsevier; 2021 [cited 2026 Apr 24]. p. 476–500. Available from: https://linkinghub.elsevier.com/retrieve/pii/B9780081029084000680 https://doi.org/10.1016/B978-0-08-102908-4.00068-0
  46. 46. Benton MJ, Wilf P, Sauquet H. The Angiosperm Terrestrial Revolution and the origins of modern biodiversity. New Phytol. 2022;233(5):2017–35. pmid:34699613
  47. 47. Chatterjee S, Goswami A, Scotese CR. The longest voyage: tectonic, magmatic, and paleoclimatic evolution of the Indian plate during its northward flight from Gondwana to Asia. Gondwana Res. 2013;23(1):238–67.
  48. 48. Reguero MA, Gelfo JN, López GM, Bond M, Abello A, Santillana SN, et al. Final Gondwana breakup: the Paleogene South American native ungulates and the demise of the South America–Antarctica land connection. Global Planet Change. 2014;123:400–13.
  49. 49. French RK, Anderson SH, Cain KE, Greene TC, Minor M, Miskelly CM, et al. Host phylogeny shapes viral transmission networks in an island ecosystem. Nat Ecol Evol. 2023;7(11):1834–43. pmid:37679456
  50. 50. Spencer V, Nemec Venza Z, Harrison CJ. What can lycophytes teach us about plant evolution and development? Modern perspectives on an ancient lineage. Evol Dev. 2021;23(3):174–96. pmid:32906211
  51. 51. Falcon-Lang HJ, Dimichele WA. What happened to the coal forests during pennsylvanian glacial phases? PALAIOS. 2010;25(9):611–7.
  52. 52. Maumus F, Epert A, Nogué F, Blanc G. Plant genomes enclose footprints of past infections by giant virus relatives. Nat Commun. 2014;5:4268. pmid:24969138
  53. 53. Silvestro D, Cascales-Miñana B, Bacon CD, Antonelli A. Revisiting the origin and diversification of vascular plants through a comprehensive Bayesian analysis of the fossil record. New Phytol. 2015;207(2):425–36. pmid:25619401
  54. 54. Vassilieff H, Maumus F, Haddad S, Jamilloux V, Teycheney PY. Caulifinder banks [Internet]. Recherche Data Gouv; 2023 [cited 2026 Apr 24]. Available from: https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.57745/ADFNMB https://doi.org/10.57745/ADFNMB
  55. 55. Serfraz S. Caulimoviridae evolution through paleovirological approaches. Antilles. 2021.
  56. 56. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinform. 2009;10:421. pmid:20003500
  57. 57. Van Dongen S. Graph clustering via a discrete uncoupling process. SIAM J Matrix Anal Appl. 2008;30(1):121–41.
  58. 58. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30(4):772–80. pmid:23329690
  59. 59. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol. 2020;37(5):1530–4.
  60. 60. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods. 2017;14(6):587–9. pmid:28481363
  61. 61. Rambaut A, Drummond AJ, Xie D, Baele G, Suchard MA. Posterior summarization in Bayesian phylogenetics using Tracer 1.7. Syst Biol. 2018;67(5):901–4. pmid:29718447
  62. 62. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 2021;49(D1):D412–9. pmid:33125078
  63. 63. Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28(3):405–20. pmid:9223186
  64. 64. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3. https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.1093/bioinformatics/btp348
  65. 65. Zonneveld BJM. Genome sizes of all 19 Araucaria species are correlated with their geographical distribution. Plant Syst Evol. 2012;298(7):1249–55.
  66. 66. Huang X, Madan A. Cap3: A DNA sequence assembly program. Genome Res. 1999;9(9):868–77. pmid:10508846