Evolution of and Horizontal Gene Transfer in the Endornavirus Genus

The transfer of genetic information between unrelated species is referred to as horizontal gene transfer. Previous studies have demonstrated that both retroviral and non-retroviral sequences have been integrated into eukaryotic genomes. Recently, we identified many non-retroviral sequences in plant genomes. In this study, we investigated the evolutionary origin and gene transfer of domains present in endornaviruses which are double-stranded RNA viruses. Using the available sequences for endornaviruses, we found that Bell pepper endornavirus-like sequences homologous to the glycosyltransferase 28 domain are present in plants, fungi, and bacteria. The phylogenetic analysis revealed the glycosyltransferase 28 domain of Bell pepper endornavirus may have originated from bacteria. In addition, two domains of Oryza sativa endornavirus, a glycosyltransferase sugar-binding domain and a capsular polysaccharide synthesis protein, also exhibited high similarity to those of bacteria. We found evidence that at least four independent horizontal gene transfer events for the glycosyltransferase 28 domain have occurred among plants, fungi, and bacteria. The glycosyltransferase sugar-binding domains of two proteobacteria may have been horizontally transferred to the genome of Thalassiosira pseudonana. Our study is the first to show that three glycome-related viral genes in the genus Endornavirus have been acquired from marine bacteria by horizontal gene transfer.


Introduction
Eukaryotic genomes have acquired genetic information through two different mechanisms throughout the course of evolution. The first mechanism is vertical gene transfer, in which the progeny receives genetic information from their ancestors, such as their parents. The second mechanism is horizontal gene transfer (HGT), which is the transfer of genetic information between unrelated species [1]. Evidence for HGT events has frequently been observed in prokaryotes and eukaryotes [2][3][4][5]. Numerous studies have suggested that HGT is one of important keys to understanding the evolution of prokaryotic and eukaryotic genomes [1,3].
One frequent HGT event might be between a virus and the host [6]. Among the many known viruses, the retroviruses can easily integrate their viral genes or genomes into the host chromosomes because these RNA viruses utilize a reverse transcriptase to produce DNA from the RNA genome for their replication in a host cell [7,8]. Consequently, a large number of retroviral sequences are found in eukaryotic genomes through sequencing and comparative analyses [9]. Endogenous hepadnaviruses have been discovered in the genomes of passerine birds, which include more than half of all bird species [10]. Previous studies have also identified endogenous pararetroviruses (EPRVs) in plant genomes [11][12][13]. EPRVs integrate into plants' nuclear genomes and become part of the plants genomes as the result of evolutionary forces [14].
Recently, several studies have demonstrated that non-retroviral sequences can also be integrated into eukaryotic genomes [15][16][17]. For instance, non-retroviral elements homologous to sequences in Bornavirus, Filovirus, Circovirus, and Parvovirus have been discovered in the genomes of several mammalian species [15]. The integration of non-retroviral RNA virus sequences (NRVSs) has also been demonstrated for several plant genomes, and multiple integration events for non-retroviral sequences into different plant lineages have been identified [16].
The members of the genus Endornavirus are not retroviruses, and this genus was recently created as a new genus of double-stranded (ds) RNA viruses in the family Endornaviridae by the International Committee on Taxonomy of Viruses (ICTV) [18]. The genomes of endornaviruses are linear dsRNAs of 9.8-17.6 kb in length and have only one open reading frame (ORF) [19]. These ORFs normally encode a single polypeptide that is thought to be processed by a proteinase, and the genome contains conserved motifs, including an RNA-dependent RNA polymerase (RdRp) and viral RNA helicases (Hel) [20]. Endornaviruses seem not to form true virions and are usually present at a low copy number [21]. These viruses have been found in plants, fungi, and protists [22]. Recently, our group identified several viral sequences that are homologous to plant genes. The gene transfer of such endogenous viral sequences might have occurred from the virus to the host or from the host to the virus. In this study, we obtained strong evidence for gene transfer between the virus and the host using the endornaviruses as a model. Based on these results, we propose a hypothesis related to the evolutionary origins and horizontal gene transfer of endornaviral genes.

Sequence alignment and phylogenetic analysis
To align and visualize sequences, ClustalW implemented in MEGA 5 software was used. The most appropriate substitution models were selected for each aligned sequence according to Akaike's information criterion (AIC) calculated using the ProtTest server (http://darwin.uvigo.es/software/prottest_server.html) [23]. For the phylogenetic analysis presented in Figure 1C, the CpREV+I+G model was selected as the best-fit substitution model. The LG+I+G model was selected for the phylogenetic analyses presented in Figure 2, Figure 3A, Figure 3B, supplementary figure S1, S3A, and S3B, each of which has a distinct gamma parameter and proportion of invariable sites. Phylogenetic trees were generated using the PhyML 3.0 server (http://www.atgcmontpellier.fr/phyml/) according to the best-fit models suggested by the ProtTest server. BIONJ was used as a starting tree, and subtree pruning and regrafting (SPR) was used for tree improvement [24]. The approximate likelihood ratio test (aLRT) values were calculated using Shimodaira-Hasegawa-like (SH-like) procedure, and each branch is labeled with the result [25]. All obtained trees were edited using FigTree version 1.3.1 (http:// tree.bio.ed.ac.uk/software/).

Detection of HGT
The 16S rRNA sequences of various species were retrieved from the SILVA rRNA database (http://www.arb-silva.de/). Phylogenetic trees based on the rRNA sequences of diverse species were generated using MEGA 5 software with the neighbor-joining method and bootstrap support of 1000 replicates after alignment using the ClustalW method. Protein trees were rerooted as species trees using FigTree version 1.3.1. The generated species and protein trees were converted into the Newick format using MEGA 5 and FigTree version 1.3.1, respectively. The detection of HGT was performed using the T-REX (Tree and Reticulogram Reconstruction) web server (http://www.trex.uqam.ca) [26].

Identification of endornavirus-like sequences in various plant genomes
BLAST searches identified several endornavirus-sequences in plant genomes. Among sequences from known endornaviruses, only partial sequences for Bell pepper endornavirus (BPEV) [27] and Oryza sativa endornavirus (OsEV) [28] have been shown to be homologous to specific regions of plant proteins. A total of 30 nonredundant endogenous BPEV-like sequences, referred to as EBPEs, were identified in 19 plant species (Table 1). Some plant species harbor multiple EBPEs. For example, Populus trichocarpa, Glycine max, and Cucumis sativus each possess two EBPEs, whereas Citrus sinensis and Physcomitrella patens harbor three and four EBPEs, respectively (Table 1). Of known algae species, only T. pseudonana contains an EBPE, and other monocot plants, such as sorghum and rice, carry several EBPEs. Interestingly, all identified EBPEs are homologous to one specific domain of BPEV, which is referred as glycosyltransferase 28 (GT28) domain ( Figure 1A). The murG gene of Escherichia coli containing the GT28 domain functions in the membrane steps of peptidoglycan synthesis [29]. The lengths of the identified EBPEs are variable, ranging from 61 to 467 amino acids (aa). The alignment of the amino acid sequences of the identified EBPEs and the GT28 domain of BPEV revealed high levels of sequence identity (%) ( Figure 1B). The phylogenetic tree contains two distinct clades ( Figure 1C). The first clade includes only TpEBPE and BPEV, whereas the second clade contains most EBPEs, which could be divided into two sister groups ( Figure 1C).

Identification of EBPEs from various databases
With the development of several high-throughput sequencing technologies, a large number of sequencing data from many plant species are being produced [30]. We also identified 29 EBPEs from expressed sequence tag (EST) (18 sequences) and transcriptome shotgun assembly (TSA) (9 sequences) databases ( Table 2). Of these 29 EBPEs, one was identified among the ESTs of the pepper plant (Capsicum annuum), which is a host plant for BPEV. In a search for EBPEs in various databases, we found that the GT28 domain exists in other organisms, including bacteria and fungi (Table 3). For example, 21 EBPEs were derived from 18 fungal species including Cryptococcus neoformans, Coccidioides immitis, and Coccidioides posadasii, and these fungi each possess a GT28 domain plant proteins using ClustalW. (C) Phylogenetic tree based on the glycosyltransferase domains of BPEV and 30 identified plant proteins constructed using the PhyML 3.0 server. The numbers on the branches are the aLRT values calculated using a SH-like method. Numbers greater than 0.5 are shown on each branch. doi:10.1371/journal.pone.0064270.g001 with 30% to 43% identity to the GT28 domain BPEV (Table 3). In addition, many EBPEs are present in diverse bacteria, including Verrucomicrobiae bacterium, Burkholderia ambifaria, and Burkholderia ubonensis. Rather than identifying EBPEs using BLAST searches, we identified sequences containing the GT28 domain (PF03033) in the Pfam database (http://pfam.sanger.ac.uk/) [31]. We found 2,898 sequences containing the GT28 domain in 2,051 species. Most of these sequences are derived from various bacteria (2,636 sequences from 1,963 species). Only 251 sequences are derived from the Eukaryota, covering 89 species, and 7 sequences are from the Archaea, covering 4 species belonging to the family Methanosarcinaceae (Table 3).

Phylogenetic relationships and domain structures for plant GT28-containing proteins
BLAST searches using the BPEV sequences could not detect all GT28-containing proteins in plant genomes. Using the Phytozome ver. 7 database (http://www.phytozome.net) [32], we found 78 proteins containing the GT28 domain from 23 plant species. To elucidate the phylogenetic relationships of GT28-containing proteins, we constructed a phylogenetic tree ( Figure S1A). The phylogenetic tree shows the most GT28-containing plant proteins except CsEBPE1 are closely related. The domain structures of GT28-containing proteins show the localization of the GT28 domain in each protein ( Figure S1B). Most GT28 domains are located in the 59 region of each protein, but three EBPEs from Physcomitrella patens have the GT28 domain in the middle of the protein (Supplementary figure S1B). Interestingly, most GT28containing proteins include an additional domain referred to as a UDP glycosyltransferase (UGT) (PF0001) in the C-terminal region. Algae including Chlamydomonas reinhardtii and Volvox carteri encode proteins containing only a GT28 domain and not a UGT domain, whereas mosses including Physcomitrella patens and Selaginella moellendorffii have several proteins that contain both the GT28 and UGT domains. In addition, the numbers of exons and introns in GT28-containing genes in plants are diverse. For example, two Arabidopsis thaliana genes containing the GT28 domain consist of 15 exons and 14 introns whereas two Arabidopsis lyrata genes containing GT28 domain comprise 14 exons and 13 introns.

Phylogenetic relationships of all identified EBPEs
To reveal the phylogenetic relationships of EBPEs and the origin of the GT28 domain in BPEV, we constructed a phylogenetic tree using all identified EBPEs from plants, bacteria, and fungi. The phylogenetic tree includes two largely divided clades ( Figure 2). The first clade encompasses EBPEs from plants, fungi, and bacteria, whereas the second clade comprises only EBPEs from fungi. In the first clade, the plant EBPEs appear to be generally monophyletic except for those of two diatoms, T. pseudonana and Fragilariopsis cylindrus. Interestingly, the clade containing these two diatoms includes three bacteria species, Cyanothece sp. PCC 7822, Bacillus megaterium, and Maribacter sp. HTCC2170. Surprisingly, the GT28 of BPEV is closely related to those of two bacteria, Verrucomicrobiae bacterium and Frankia sp. EAN1pec. In addition, Geomyces pannorum, which is a type of saprophytic fungi, is grouped together with other higher plants. These results suggest that the gene transfer of the GT20 domain might have occurred among diatoms, bacteria, fungi, and endornaviruses.

Conserved domains present in nine endornaviruses
Based on the above result, we hypothesized that other domains present in endornaviruses might have originated from other organisms. Next, we examined the conserved domains of nine endornaviruses for which the complete protein sequences are currently available (Supplementary figure S2). The ORF lengths of these endornaviruses ranges from 3,217 aa to 5,825 aa. The Vicia faba endornavirus (VfEV) is the largest endornavirus, but it has only two conserved domains, a viral helicase domain, and an RdRp. Tuber aestivum endornavirus (TaEV) is the smallest endornavirus containing a DEAD-like helicase domain and an RdRp. Although all nine endornaviruses contain an RdRp domain, the compositions of the other domains are highly variable. One of the common domains in the nine endornaviruses is the UGT domain, which is present in six endornaviruses that infect plants, fungi or protists (Supplementary figure S2). No sequences in plants highly similar to the UGT sequence were identified. In addition, there is no available information for UGT in the Pfam database.

Phylogenetic analysis of two distinct glycome-related domains in OsEV
In addition to UGT, OsEV contains two glycome-related domains, the glycosyltransferase sugar-binding region containing DXD motif (GTS) (PF04488) and the capsular polysaccharide synthesis protein (CPSP) (PF05704) (Supplementary figure S2). GTS is a GT, and the DXD motif of GTS is required for carbohydrate binding in sugar-nucleoside diphosphate-and manganese-dependent glycosyltransferases [33]. According to the Pfam database, there are at least 508 species, including 175 Eukaryota, 326 bacteria, 5 viruses, and 1 Archaea, that encode the GTS domain. To identify the phylogenetic relationships of GTSs from various species, we performed a BLAST search and constructed a phylogenetic tree, which contains two distinct clades ( Figure 3A). The first clade includes only GTSs from diverse bacteria and OsEV. The second clade is composed of primarily of GTSs from fungi, along with one diatom (T. pseudonana) and two bacteria (Rhodopirellula baltica and Micrococcus luteus). The GTS of T. pseudonana is more closely related to that of M. luteus. Next, we searched the Pfam database and identified 235 sequences from 171 species containing CPSP; these species included 163 bacteria, 25 Eukaryota, and one virus. Using the CPSP sequences highly homologous to that of OsEV, we constructed a phylogenetic tree, which had two distinct clades ( Figure 3B). The first clade contained sequences from OsEV and bacteria. The second clade included T. pseudonana, three fungi (Neosartorya fumigata, Botryotinia fuckeliana, and Nectria haematococca), and two bacteria (Thalassibium sp. and Maricaulis maris).

Prediction of horizontal gene transfer for each domain in endornaviruses
The phylogenetic analyses suggested that at least BPEV and OsEV acquired several domains via HGT. It is likely that HGT of glycome-related domains might have occurred among different organisms. To assess this possibility, we compared trees between given pairs of species and domains as described in the materials and methods. We excluded endornaviruses from the analyses, as these are not assigned to the tree of life. In the case of the GT28 domain, at least four independent HGTs have occurred among plants, fungi, and bacteria ( Figure 4A). The GT28 domain in plants might have been transferred to Geomyces pannorum, T. pseudonana, and Vibrio coralliilyticus. The Bacillus megaterium obtained GT28 domain from T. pseudonana. The GTSs of two proteobacteria, M. maris and Thalassiobium sp., might have been horizontally transferred to the genome of T. pseudonana ( Figure 4B).

Discussion
In the current study, we conducted phylogenetic analyses to explore the evolutionary origins of protein domains present in endornaviruses. Due to the limited number of available sequences for endornaviruses, only a few domains for endornaviruses were further analyzed. Our analyses allowed us to (i) identify endornavirus-like sequences in plants, fungi, and bacteria, (ii) reveal the phylogenetic relationships among these sequences, and (iii) elucidate the evolutionary origins of endornaviral genes by HGT.
Initially, all available endornavirus sequences were used in BLAST searches to identify endornavirus-like sequences in plant proteomes. Only partial sequences for BPEV and OsEV were matched to various plant proteomes, indicating that gene transfer might have occurred between endornaviruses and plant hosts. An extensive BLAST search and domain information from the Pfam database revealed that three domains, the GT28, GTS, and CPSP domains, are ubiquitous; these domains are present in Eukaryota, bacteria, Archaea, and viruses. These results suggest that some endornaviral genes might have been obtained from the host or pseudonana are present together with BPEV in the same clade, and all three live in marine and freshwater environments. This result suggests two possible scenarios for how endornaviruses acquired the GT28 domain from their hosts or other organisms. The first scenario is direct horizontal gene transfer from marine bacteria to ancient endornaviruses that infect marine algae such as diatoms via unknown events, which have caused genetic recombination. The second scenario is that marine diatoms first obtained the GT28 domain from marine bacteria that infect the diatoms, and then the ancient endornaviruses obtained the GT28 domain from the marine diatom host. T. pseudonana is a marine diatom that acquires plastids through secondary endosymbiosis [34]; a previous study found that T. pseudonana has acquired foreign genes such as membrane transporter genes via endosymbiotic/horizontal gene transfer (E/HGT) to adapt them in marine environments [35]. Moreover, a recent study suggested that T. pseudonana is likely ancestrally a freshwater organism [36]. Therefore, we tentatively support the second scenario because the HGT of the GT28 gene could have occurred between diatoms and bacteria due to their presence in marine and freshwater environments and because phylogenetic evidence revealed that the sequences for T. pseudonana and BPEV were in the same clade. To date, endornaviruses have been identified only in Eukaryota, including plants, fungi, and Chromista [22]. Based on our analysis, we propose the existence of endornaviruses that infect marine algae. The ancient endornaviruses that infected marine algae might have co-evolved with their hosts, and they might have begun infecting land plants during the evolution of higher plants. Thus, unidentified endornaviruses that infect marine algae have domain structures that are very similar to those of endornaviruses that infect higher plants. It is known that endornaviruses are only vertically transmitted through seeds [22], which could support the co-evolution of the endornavirus with their hosts. The Arabidopsis genome contains two genes (UGT80A2 and UGT80B1) that possess GT28 and UGT domains; these genes encode UDP-glucose:sterol glycosyltransferases enzymes (EC 2.4.1.173) [37]. These enzymes are involved in the synthesis of steryl glycosides (SGs) [38]. The UGT80A2 mutant showed mild defects in plant growth, whereas the UGT80B1 mutant exhibited severe phenotypes at both the embryo and seed stages. UGT80B1 is required for the deposition of flavanoids, the suberization of the seed, and the trafficking of lipid polyesters in membranes [37].
Genes encoding SGs are ubiquitous in plants [39] and various fungi [40]. The null mutant of this gene in Saccharomyces cerevisiae exhibited normal growth under diverse culture conditions despite the reduced ability to synthesize sterol glucoside [40]. In bacteria, the murG gene from Escherichia coli contains two duplicated GT20 domains localized at the N-terminal (PF03033) and C-terminal (PF04101) regions, respectively. The mutation of murG led to an altered cell shape and a lytic thermosensitive phenotype [29]. CPSP is known to be a major virulence factor in Streptococcus pneumonia and plays an important role in the production of a mature capsule in vitro [41]. Thus, the functions of genes related to glycosyltransferases are diverse and vary depending on the organism.
It is known that some DNA viruses contain several GT domains, but hypoviruses are the first RNA viruses known to encode GTs named as UGT [42,43]. It is also likely that the UGTs of hypoviruses might be originated from the host genes. Endornaviruses have at least three different domains that are highly associated with glycome modification [22]. Interestingly, all three are closely related to those of marine bacteria. The GTs encoded by DNA viruses have been well characterized. Bacteriophages, phycodnaviruses, baculoviruses, poxviruses, and herpesviruses contain genes encoding GTs [44]. These DNA viruses appear to have co-evolved with their hosts, and they acquired the GTs for replication [44]. The glycome plays an important role in many biological processes. For instance, the viral GTs of DNA viruses are involved in many different mechanisms, such as the recognition of host cells and the regulation of virus-host interactions, which are regulated by the expression of host GTs or their own viral GTs [45,46]. Moreover, the GTs of some DNA viruses play a role is disrupting host defense mechanisms by inhibiting the activities of host restriction enzymes [44]. However, nothing is known about the functional roles of GTs in endornaviruses. It will be of interest to elucidate the functions of GTs in RNA viruses in the future. Based on the previous study, the function of GTs in endornaviruses should be beneficial to the virus [44]. As suggested in the previous study, GTs in endornaviruses might function in protection of the viral RNA from degradation by modifying the RNA [22].
Previous several studies also suggested that many viral genes might have originated from prokaryotic or eukaryotic genes [47]. For example, the heat shock protein 70 in the family Closteroviridae, AlkB protein in the family Flexiviridae, and Maf/HAM1-likepyrophosphatase in the family Potyviridae are originated from the host genes via horizontal gene transfer [48][49][50]. In general, they are ubiquitous genes presenting in prokaryotes and eukaryotes, and play important roles in viral life cycles [47].
All endornaviruses contain a well-conserved RdRp, and some endornaviruses contain methyltransferase and helicase domains like most RNA viruses do. As shown in our study and in a previous study, the RdRp and viral methyltransferases of endornaviruses are similar to those of ssRNA viruses [22]. These data suggest that endornaviruses might have originated from ssRNA viruses or that the important domains in endornaviruses might have been obtained from ssRNA viruses via HGT [20]. In addition, a phylogenetic analysis found that the DEXDc domains present in GaBRV-XL and TaEV are closely related to those of bacteria and that of the fungus known as Sclerotinia sclerotiorum. Interestingly, a hypovirulent double-stranded (ds) RNA virus has been previously identified in the plant pathogen Sclerotinia sclerotiorum [51]. Therefore, we tentatively hypothesize that gene transfer occurred between the fungal host and dsRNA mycoviruses. Recently, several studies have confirmed that horizontal gene transfer has occurred between mycoviruses and the host [52].
In summary, we provide strong evidence for the HGT of domains present in endornaviruses, and we proposed hypotheses regarding their possible origin and the evolutionary scenario using phylogenetic data. Although several recent studies provide evidence for HGT, gene transfer between the virus and the host is still poorly understood. To elucidate the origin and the evolutionary processes of viral genes, rigorous systematic studies, including comparative sequence analyses and experimental studies, should be conducted.