No Evidence for Natural Selection on Endogenous Borna-Like Nucleoprotein Elements after the Divergence of Old World and New World Monkeys

Endogenous Borna-like nucleoprotein (EBLNs) elements were recently discovered as non-retroviral RNA virus elements derived from bornavirus in the genomes of various animals. Most of EBLNs appeared to be defective, but some of primate EBLN-1 to -4, which appeared to be originated from four independent integrations of bornavirus nucleoprotein (N) gene, have retained an open reading frame (ORF) for more than 40 million years. It was therefore possible that primate EBLNs have encoded functional proteins during evolution. To examine this possibility, natural selection operating on all ORFs of primate EBLN-1 to -4 was examined by comparing the rates of synonymous and nonsynonymous substitutions. The expected number of premature termination codons in EBLN-1 generated after the divergence of Old World and New World monkeys under the selective neutrality was also examined by the Monte Carlo simulation. As a result, natural selection was not identified for the entire region as well as parts of ORFs in the pairwise analysis of primate EBLN-1 to -4 and for any branch of the phylogenetic trees for EBLN-1 to -4 after the divergence of Old World and New World monkeys. Computer simulation also indicated that the absence of premature termination codon in the present-day EBLN-1 does not necessarily support the maintenance of function after the divergence of Old World and New World monkeys. These results suggest that EBLNs have not generally encoded functional proteins after the divergence of Old World and New World monkeys.


Introduction
Endogenous Borna-like nucleoprotein (EBLN) elements were recently discovered as non-retroviral RNA virus elements in the genomes of various animals, including primates, rodents, chiropterans, afrotherians, and fishes [1,2,3]. EBLNs appeared to have been derived from genomic integrations of reverse-transcribed mRNAs for the nucleoprotein (N) gene of bornavirus, which is a non-segmented, single-stranded (negative sense) RNA virus [4,5]. In primates, four copies of EBLNs (EBLN-1 to -4) were identified in the genomes of Old World and New World monkeys [1,2,3]. Each copy of EBLN-1 to -4 apparently started with the transcription start site of bornavirus N gene and ended with poly-A, and was flanked by the target site duplication, which was specific in length and sequence to each copy. These observations indicated that primate EBLN-1 to -4 were originated from four independent integrations of reverse-transcribed mRNA for bornavirus N gene by LINE before the divergence of Old World and New World monkeys (,44.2 million years ago (MYA) in TIMETREE [6]) [1,2].
Most of EBLNs appeared to be defective due to the existence of premature termination codons and frameshifts [2]. However, EBLN-1 of human (367 codons), chimpanzee (368 codons), and gorilla (368 codons) encoded an open reading frame (ORF), which was almost equivalent in length to the N gene of bornavirus (371-374 codons), indicating that the ORF has been maintained for more than 40 million years along the evolutionary lineage leading to these organisms. This phenomenon was considered unlikely to be observed in the absence of purifying selection on EBLN-1 [1,3]. In addition, human EBLN-1 to -4 were identified to be expressed in various tissues, and human EBLN-2 was found to interact with other proteins [1,2,7]. These observations raised a possibility that primate EBLNs have been functional in the host.
If primate EBLNs have encoded functional proteins, natural selection, either positive or negative, should have operated on them during evolution of primates. Natural selection operating at the amino acid sequence level may be detected by comparing the rate of nonsynonymous substitution (d N ) with that of synonymous substitution (d S ), where the relationships d N .d S , d N ,d S , and d N = d S indicate positive, negative, and no selection, respectively [8]. The purpose of the present study was to examine the functionality of EBLNs during primate evolution by identifying natural selection from the comparison of d N and d S .

ORFs in primate EBLN-1 to -4
Single orthologous regions of EBLN-1 to -4 were identified in the genomes of all primates analyzed in the present study (human, chimpanzee, gorilla, orangutan, macaque, and marmoset), except for the orthologous region of EBLN-4 in the marmoset genome, which has duplicated multiple times (Figures S1 and S2, Table S1) [3]. Macaque EBLN-1 and marmoset EBLN-4 contained sequences that were derived from Alu repeat elements (Figure 1), which were characterized as AluMacYa3, belonging to the macaque-specific Alu subfamily [9], and as AluSp or Alu2, respectively.
In order to investigate the functionality of primate EBLNs during evolution, a total of 100 ORFs with .50 codons was identified in the orthologous regions of EBLN-1 to -4 in the primate genomes ( Figure S1, Table S1). Most of these ORFs were not shared among the primates examined, because they were truncated due to premature termination codons in some species ( Figure S1). However, ORF1s of EBLN-1 in human (367 codons), chimpanzee (368 codons), and gorilla (368 codons) were almost equivalent in length to bornavirus N gene (371-374 codons). In addition, the size of ORF1a of EBLN-2 in gorilla and macaque was equal to that in human (273 codons), which is known to interact with other proteins [7]. ORF1a of macaque EBLN-1 contained the sequence derived from AluMacYa3 [9]. Parts of ORFs in marmoset EBLN-4 were also derived from transposable elements (AluSp or Alu2 for ORF2 of contig1733, ORF-1 of contig1733 and 1917, and ORF-2a of contig1129 and 6507, and Alu2 and LIME1 for ORF-2 of contig1129, 1952, 5225, and 6507).
Natural selection operating on ORFs in primate EBLN-1 to -4 In the above analysis, ORF1 of EBLN-1 and ORF1a of EBLN-2 were found to be relatively long in some primates. However, when the entire region of ORF1 of EBLN-1 was compared between primates, the d N /d S ratio ranged from 0.65 to 2.68, and no signature of natural selection was identified (Z-test: p.0.06) (Table 1). Similarly, no selection was detected for ORF1a of EBLN-2 (Z-test: p.0.1) as well as all other ORFs of EBLN-1 to -4 after the correction for multiple testing (Z-test: p.0.03) (Tables 1  and S2). In the window analysis, no common pattern was observed for the fluctuation of d N /d S ratio along linear sequences of ORFs in EBLN-1 to -4 between pairs of primates (Figures 1 and S3). Natural selection was not detected at any window between any pair of primates (Z-test: p.0.35).
When natural selection was examined at each branch of the phylogenetic trees for ORFs of EBLN-1 to -4, negative selection was detected at the basal branch of the phylogenetic tree for ORF1 of EBLN-1 and ORF1a of EBLN-2 after the correction for multiple testing (branch a in Figure 2; d N /d S ratio = 0.25, likelihood ratio test (LRT): p = 4.52610 26 ) (Table S3). It should be noted that this branch reflects the evolution of not only EBLN-1 and -2 but also bornavirus N gene before integration, because two independent integrations of bornavirus N gene appears to have taken place to give rise to EBLN-1 and -2 on this branch, as discussed above. No selection was detected at any other branches of the phylogenetic trees reflecting the evolution of EBLN-1 to -4 after the correction for multiple testing (LRT: p.1.34610 23 ) (Figures 2 and S4, Table S3).
Probability of maintaining ORF1 of EBLN-1 after the divergence of Old World and New World monkeys under the selective neutrality Two studies have argued for the possibility that purifying selection has operated on ORF1 of EBLN-1 after the divergence of Old World and New World monkeys [1,3]. Both arguments were based on the observation that the premature termination codon was absent in ORF1s of present-day EBLN-1 for human, chimpanzee, and gorilla, which was considered to be unexpected under the selective neutrality [1,3].
In the first study [1], the generation rate of termination codon was computed theoretically based on the assumption that the rate of nucleotide substitution was 1.2610 29 per site per year under the selective neutrality. It was claimed that there is a 12% probability that a random codon change will produce a termination codon in one mutational step when the rates of all nucleotide substitutions are the same. The generation rate of termination codon was estimated to be 1 per 2,310 codons per million years [1]. Therefore, if we assume that the divergence time of Old World and New World monkeys was 44.2 MYA or 54.1 MYA and the ancestral EBLN-1 consisted of 371 codons, 7.08 or 8.67 premature termination codons were expected to be observed in the present-day EBLN-1, respectively, which were much greater than zero [1].
In the second study [3], the Monte Carlo simulation was conducted for the evolution of EBLN-1 after the divergence of Old World and New World monkeys under the selective neutrality. It was assumed that the divergence time of Old World and New World monkeys was 54.1 MYA and the rate of nucleotide substitution under the selective neutrality was 2.2610 29 per site per year. When the consensus sequence of EBLN-1 was evolved according to this evolutionary scenario, the number of premature termination codons in the present-day EBLN-1 was 15.57 on average and the probability of observing zero premature termination codon was p,0.00001 [3].
However, there appeared some problems in these studies [1,3]. First, it is well-known that the proportion of nucleotide substitution producing termination codons under the assumptions of equal rates for all nucleotide substitutions and equal frequencies for all codons is 4% [10], which is one-third of the value (12%) assumed in [1]. Therefore, the expected number of premature termination codons in the present-day EBLN-1 may not be 7.08 or 8.67 but 2.36 or 2.89, which is not very much different from zero. In addition, in [3], the expected number of nucleotide substitutions occurring in 371 codons (1,113 nt) during 54.1 million years under the rate of 2.2610 29 per site per year is 132.11. If we assume that the probability for a nucleotide substitution to produce a termination codon is 4%, the expected number of premature termination codons in the present-day EBLN-1 is 5.3 (132.1160.04), which is much smaller than 15.57 predicted by the simulation [3].
Second, the divergence time of Old World and New World monkeys (54.1 MYA) and the rate of nucleotide substitution in primates (2.2610 29 per site per year) assumed in [3] may not be appropriate. In fact, the divergence time of Old World and New World monkeys has been estimated to be 44.2 MYA in TIMETREE [6]. In addition, the evolutionary rate of 2.2610 29 per site per year appears to have been reported as the average rate for mammals [11], which is higher than that estimated from the analysis of primate genomes evolving under the selective neutrality (0.99-1.5610 29 per site per year) [12].
For these reasons, we performed the Monte Carlo simulation for the evolution of EBLN-1 after the divergence of Old World and New World monkeys, similarly to that conducted in [3], under various conditions. When the consensus sequence of EBLN-1 (kindly provided by Dr. Robert J. Gifford) was evolved for 54.1 million years with the rate of 2.2610 29 per site per year, which mimics the simulation in [3], the average number of premature termination codons was 3.24 ( Figure 3F), which was significantly smaller than that (15.57) obtained in [3]. In addition, the probability of observing zero premature termination codon was 0.04, which was significantly greater than that (p,0.00001) obtained in [3]. Although the reason for this discrepancy was unclear, it may be noted that a somewhat similar result to [3] (average number of premature termination codons = 13.54 and p,0.00001) was observed when the divergence time or the rate was assumed to be ten times greater than that used above (541 MYA instead of 54.1 MYA or 22610 29 per site per year instead of 2.2610 29 per site per year) ( Figure S5).
In the above simulation, the probability of observing zero premature termination codon was smaller than 0.05 (p = 0.04)  (Figure 3A-3E). Similar results were obtained when the inferred ancestral sequence and the present-day bornavirus N sequence were evolved in the simulation ( Figures S6 and S7). These observations indicate that the absence of premature termination codon in the presentday EBLN-1 does not necessarily support the maintenance of function after the divergence of Old World and New World monkeys.

Discussion
Natural selection was not detected for the entire region or parts of the ORFs of EBLN-1 to -4 between any pair of primates and for any branch of the phylogenetic trees for EBLN-1 to -4 after the divergence of Old World and New World monkey, suggesting that primate EBLN-1 to -4 have not encoded functional proteins during this period. These results conflicted with those in the previous studies, which suggested that purifying selection has operated on EBLN-1 to maintain ORF1 during this period [1,3]. However, there appeared some problems in the previous studies, as discussed above. When the divergence time of Old World and New World monkeys and the neutral substitution rate assumed in the previous study [3] were changed to more realistic values, the probability of observing zero premature termination codon was found to be p.0.06 in the Monte Carlo simulation. Therefore, the absence of premature termination codon in ORF1s of the presentday EBLN-1 does not necessarily support the maintenance of function after the divergence of Old World and New World monkeys. The truncation of ORF1 due to premature termination codons in orangutan EBLN-1 and the insertion of an Alu element in macaque EBLN-1 also support the absence of functional constraint on EBLN-1.
It should be noted, however, that negative selection was detected at the basal branch of the phylogenetic tree for ORF1 of EBLN-1 and ORF1a of EBLN-2. This branch appeared to represent the evolution of bornavirus N gene before integration as well as EBLN-1 and -2 before the divergence of Old World and New World monkeys. Since the d N /d S ratio for bornavirus N gene is generally smaller than unity (data not shown), negative selection identified for the basal branch of EBLN-1 and -2 likely reflects the functional constraint on bornavirus N gene before integration. However, it might be possible that negative selection has also operated on EBLN-1 and -2 before the divergence of Old World and New World monkeys. This is because expression of bornavirus N protein is known to inhibit replication of bornavirus in rats and mice [13,14], and therefore EBLN-1 and -2, if expressed as is the case for present-day EBLN-1 and -2 in humans, might have conferred selective advantages to ancient primates by preventing bornavirus infection. In this regard, it is interesting to note that EBLNs have been observed in the genomes of primates, rodents, chiropterans, opossums, afrotherians, and fishes [1,2], where prevalence of bornavirus has not been reported. In contrast, birds, cattle, sheep, and horse, which do not harbor EBLNs, are known as the natural hosts of bornavirus [4,5,15]. It might be speculated that as EBLNs protected ancient primates from infection by primate bornavirus, this virus may have changed the host preference or become extinct, which may have promoted the relaxation of functional constraint on EBLNs after the divergence of Old World and New World monkeys.
In conclusion, no evidence for natural selection was identified for EBLN-1 to -4 after the divergence of Old World and New World monkeys. In the present study, we mainly focused on the functionality of EBLNs as proteins. However, considering that mRNA expression of EBLNs has been detected in various tissues [1,2], it is tempting to speculate that EBLNs might have been functional as RNAs. In addition, interactions identified between human EBLN-2 and cellular proteins [7] point to a possibility that human EBLN-2 might have been acquiring a new function. Further studies are needed to fully understand the functional significance of EBLNs integrated into the host genome.

Identification of ORFs in primate EBLN-1 to -4
The nucleotide sequences of EBLN-1 to -4, which were reported in Horie et al. (2010) [2], as well as their flanking 1,000 nt up-stream and down-stream each in primate genomes (human, chimpanzee, gorilla, orangutan, macaque, and marmoset) were retrieved from the Ensemble Genome Browser. To examine the entire coding capacity of EBLN-1 to -4, all ORFs with .50 codons overlapping with the original EBLN-1 to -4 were extracted. The ORFs whose first codon shared the reading frame with bornavirus N gene (isolate H1499; Accession no. AY374520) were named ORF1. The ORFs encoded on the same strand as ORF1 were named ORF2 and ORF3, when position 1 of the first codon corresponded to positions 2 and 3 of codons in bornavirus N gene, respectively. The ORFs encoded on the opposite strand of ORF1 were named ORF-1, ORF-2, and ORF-3, when position 1 of the first codon was complementary to positions 1, 2, and 3 of codons in bornavirus N gene, respectively. Some of ORF1, ORF2, ORF3, ORF-1, ORF-2, and ORF-3 were named ORF1a, ORF1b, and so on, when they were partial. Undetermined nucleotides in gorilla, orangutan, and marmoset were treated as gaps in the analysis. Transposons found in the ORFs were classified using Repbase [16].
Phylogenetic analysis of primate EBLN-1 to -4 Multiple alignment of nucleotide sequences for primate EBLN-1 to -4 was made by using the computer program Clustal W [17]. The general time reversible model (GTR) with the gamma distribution for the rate heterogeneity among sites (G) was selected as the optimum model of nucleotide substitution for primate EBLN-1 to -4 by MODELTEST [18] with PAUP (ver. 4.0). Phylogenetic tree for EBLN-1 to -4 was constructed by the neighbor-joining (NJ) [19], maximum likelihood (ML) [20], and Bayesian methods with GTR+G by using PAUP, PhyML (ver. 3.0) [21], and MrBayes (ver. 3.1) [22], respectively. The reliability for the NJ and ML trees was assessed using the bootstrap probability with 1,000 and 100 re-samplings, respectively. To construct the Bayesian tree, the Markov chain Monte Carlo chains were run for 1,000,000 generations with a burn-in of first 25,000 generations, and the phylogenetic tree was sampled every 1,000 generations. The credibility of the interior branch was assessed as the posterior probability.

Statistical analysis of natural selection operating on ORFs of primate EBLN-1 to -4
For each of EBLN-1 to -4, multiple alignments of nucleotide sequences for ORFs were made by using Clustal W. Natural selection operating over the entire ORFs was inferred by computing the d N /d S ratio between pairs of orthologous sequences using the Pamilo-Bianchi-Li method [23,24] with the pairwise deletion option in MEGA. The standard errors of d N and d S were estimated using the bootstrap method with 1,000 re-samplings, and the null hypothesis of no selection (d N = d S ) was tested by the Z-test [10]. Bonferroni correction was applied to account for multiple testing, where the significance level for individual tests was modified by considering the number of independent tests with the family-wise significance level set at p = 0.05. Natural selection operating on particular regions of ORFs was examined by the sliding window analysis using CRANN [25]. The d N /d S ratio between orthologous sequences was estimated using the Pamilo-Bianchi-Li method for each window (window size = 30 codons, step size = 1 codon). The null hypothesis of no selection (d N = d S ) was tested by the Z-test.
Natural selection was also examined at each branch of the phylogenetic trees for ORFs of EBLN-1 to -4. The topology of the species tree was used in the analysis, because EBLN-1 to -4 of all primates analyzed in the present study were considered to be otrthologous [2] and the topology within each cluster of EBLN-1 to -4 in the NJ, ML, and Bayesian trees was almost the same as that for the species tree of primates ( Figure S2). The d N /d S ratio was estimated for each branch by the maximum likelihood method using the codon substitution model in PAML (version 4.0) [26]. The equilibrium codon frequencies were treated as free parameters, and the d N /d S ratio was estimated under the free-ratio (selection) model and the branch specific (null) model. In the selection model, the d N /d S ratio was allowed to vary among branches, whereas in the null model, the d N /d S ratio was fixed to be 1 at specified branches. Since different results can be obtained depending on the initial d N /d S ratio in PAML [27], 0.4, 1, and 3.14 (and 15 in some cases) were used as the initial d N /d S ratio, and the results with the highest likelihood values were adopted as the final results.
The null hypothesis of no selection (d N /d S ratio = 1) for the specified branch was tested by the LRT, where twice the difference in the log-likelihood value was assumed to follow a x 2 distribution with the degree of freedom equal to the difference in the number of parameters estimated in the null and selection models. Since the LRT was conducted for a total of 108 branches in the phylogenetic trees for EBLN-1 to -4, the significance level for individual tests was set at p = 0.00046, which corresponded to the family-wise significance level of p = 0.05, using the Bonferroni correction for multiple testing.

Simulation
The probability that primate EBLN-1 has maintained ORF1 under the selective neutrality (no selection) after the divergence of Old World and New World monkeys was obtained by simulating the evolution of primate EBLN-1 using Seq-Gen [28]. The ancestral sequence of primate EBLN-1 (1,104nt) was inferred from the sequences of EBLN-1s for human, chimpanzee, gorilla, orangutan, and macaque using the Bayesian method implemented in PAML. Macaque EBLN-1, from which the Alu element was removed, was used as the outgroup to determine the position of the root for other sequences, where the ancestral sequence was inferred. The sequences of inferred ancestral primate EBLN-1 (1,104nt), consensus primate EBLN-1 (1,104nt) reported previously [3], and bornavirus N gene (1,113nt) (isolate H1499; Accession no.: AY374520) were evolved after removing termination codons with three evolutionary rates (1.0610 29 , 1.5610 29 , and 2.2610 29 per site per year [3,11,12]) and two divergence times of Old World and New World monkeys (44.2 MYA and 54.1 MYA [3,6]). The transition/transversion rate ratio was assumed to be 4. The number of premature termination codons in the simulated sequence was counted for 100,000 iterations.