Improved Integration Time Estimation of Endogenous Retroviruses with Phylogenetic Data

Background Endogenous retroviruses (ERVs) are genetic fossils of ancient retroviral integrations that remain in the genome of many organisms. Most loci are rendered non-functional by mutations, but several intact retroviral genes are known in mammalian genomes. Some have been adopted by the host species, while the beneficial roles of others remain unclear. Besides the obvious possible immunogenic impact from transcribing intact viral genes, endogenous retroviruses have also become an interesting and useful tool to study phylogenetic relationships. The determination of the integration time of these viruses has been based upon the assumption that both 5′ and 3′ Long Terminal Repeats (LTRs) sequences are identical at the time of integration, but evolve separately afterwards. Similar approaches have been using either a constant evolutionary rate or a range of rates for these viral loci, and only single species data. Here we show the advantages of using different approaches. Results We show that there are strong advantages in using multiple species data and state-of-the-art phylogenetic analysis. We incorporate both simple phylogenetic information and Monte Carlo Markov Chain (MCMC) methods to date the integrations of these viruses based on a relaxed molecular clock approach over a Bayesian phylogeny model and applied them to several selected ERV sequences in primates. These methods treat each ERV locus as having a distinct evolutionary rate for each LTR, and make use of consensual speciation time intervals between primates to calibrate the relaxed molecular clocks. Conclusions The use of a fixed rate produces results that vary considerably with ERV family and the actual evolutionary rate of the sequence, and should be avoided whenever multi-species phylogenetic data are available. For genome-wide studies, the simple phylogenetic approach constitutes a better alternative, while still being computationally feasible.


Introduction
Retroviral infections have been a constant on animal life for millions of years. Occasionally, some of these genetic parasites integrate into the germline as endogenous retroviruses (ERVs). Genetic footprints such as these constitute up to 8% of the human genome [1]. Many of these ERVs lay now dormant, after millions of years of genetic change, whereas some still retain protein coding capability and play roles in the host organism that range from adhesion promotion [2,3,4] to immune response modulation [5], while also being implied in diseases such as multiple sclerosis [6,7] and correlated with certain types of cancer [8].
Being originated from their extant counterparts, ERVs share the same genetic structure and organization. There are three major classes of ERVs -class I ERVs are similar to gammaretroviruses, class II ERVs are closer to beta and alpharetroviruses whereas class III ERVs are more related to spumaviruses [9]. At the genetic level, identifiable common structures such as the 59 LTR, PBS, Gag, Pro, Pol, Env, PPT and 39 LTR may or may not be present in an ERV locus [1,10,11]. The natural degeneracy of an ERV locus with neutral substitution rate and a divergence limit for nucleotide sequence recognition results in an upper limit for retroviral age that can be detected. Currently, retroviral sequences older than 250 million years cannot be found in today's genomes [12], although ERVs that are evolutionarily selected can leave their genetic footprint for longer than average, making them prime targets for detection.
Estimating the integration time makes use of the assumption that both LTRs of a retrovirus are identical at the time of infection. Once the retrovirus lodges itself in the germline, both LTRs evolve separately as if they were paralogs. This is a consequence of the particular replication cycle of the retrovirus, where both LTRs are copied from one and same template during a multistep complex process [13]. By taking into account the 39LTR-59LTR sequence divergence and empirical evolutionary rates for some ERV families, researchers have been calculating integration times based on the simple distance over rate formula. However, this method neglects several mechanisms of the ERV loci and the LTRs themselves. First, 39 LTR and 59 LTR have different evolutionary rates that depend on selective pressures on each end of the locus. Second, different species may have different evolutionary rates for homologous ERV loci. By using phylogenetic data when available, we hope to surpass these obstacles in obtaining more accurate estimations of integration times.
Assuming known speciation times, we can estimate ERV integration times by using LTR sequence divergence and both 39 LTR and 59 LTR rates (figure 1). If we take T1 as the known speciation time, corresponding to the time that each 39LTR and 59LTR take to coalesce into their common ancestors, and T2 as the unknown speciation time, for two species A and B we have that Integration time = Distance(59LTR239LTR)/(rate 59 +rate 39 ). Here, rate 59 and rate 39 is calculated as the average 59LTR and 39LTR evolutionary rates across branches (between species), respectively. By using this approach, we expect to improve the simple fixed rate method with phylogenetic corrections on the estimation.
We also estimated integration times using a computational approach based on markov-chain monte carlo simulations (MCMC) supported by a relaxed molecular clock model with dated tree nodes. This methodology requires a more thorough setup, more detailed in the materials and methods section. By comparing the results obtained from all methods we expect to draw conclusions as to the usefulness and drawbacks of these methods.

Results
After the initial sequence selection process, we conducted the research on ten endogenous retroviral loci, all with full LTR sequences, present at least in two primate species and whose phylogeny obey the primate evolutionary history.
Using the basic phylogenetic data to infer integration dates, we obtained point estimates of integration time for each endoretroviral sequence. We performed the analysis using both HKY and GTR substitution models when building phylogenies. MCMC estimation of node ages, however, provided us with confidence intervals of integration time (table 1).
Although the basic phylogenetic method gives us only a point estimate, that specific point in time is, in eight out of ten cases studied, within the confidence intervals obtained by the MCMC calculations. In order to assess how well the use of a single fixed rate of evolution for the entire endoretroviral sequence would fit the MCMC results, we used four estimated rates of evolution and two intervals of rates for human ERVs from the literature, redoing the calculations for integration time with the methodology depicted in figure 1 (table 2).
We also estimated 59 and 39 LTR substitution rates from different species pairs (table 3) showing a substantial variation in LTR substitution rates between the analyzed ERV families (figure 2). Here, the cutoff point for visual distinction of 25 Mya was used as an approximate date for the New World -Old World Monkey split [14]. This split easily indicates that the LTRs with the faster substitution rates are those of a more recent insertion time, namely the ERVK2, ERVK7 and ERVK9 along the Human-Chimp branch.

Discussion
It is clear that, when using a single rate for the whole ERV sequence for integration time calculations, several important factors are being omitted. The final estimation is highly dependent on the original assumption on how fast the endoretroviral sequence is evolving, and for old sequences estimations can vary up to 50% (see Table 2). It was also clear from our study that 59 and 39 LTRs have distinct evolutionary rates; that 39 rates are slightly higher than 59 rates and that overall rates varies greatly between ERV families (see figure 2 and Table 3). Thus, applying a single evolutionary rate to estimate the time of integration is rarely a good approximation when studying ERV sequences. Point estimates of integration dates are hard to find in the literature, except for the more recent ERV-K group. Most of them are based on fossil records of species separation, namely the New World/Old World monkeys split or the Hominid split. We compared available integration time estimates from the literature with those found by the methods described in this work.
Two works by Zanotto et al [15,16], estimated an average integration time of the ERV-K group in the human-chimpanzee cluster to 18,3 million years before present (MYBP). That estimate shrunk to 7.8 MYBP, when the analysis was constricted to only the ERV-K present in humans. The latter results are consistent with the findings of our work for the most recent ERV-K loci studied, ERV-K2 and ERV-K7. However, ERV-K loci derived from older lineages, such as the ERV-K3 and ERV-K9, imply a much older integration time. The variability in integration times and rates found within the ERV-K family may discourage a broad generalization on their properties.
Other estimates point out only rough time intervals of estimation. ERV3 is thought to be originated more than 30 million years ago [17], an assumption verified by our results that place the ERV3 integration around 42 million years ago. The ERVPB1 locus had been previously timed around 30 million years old based on PCR amplification from different primates [18], an estimate placing the integration of ERVPB1 at a more recent time than both of the phylogenetic LTR divergence (53,58-54,74 Mya) and the MCMC estimation (39,5-86,3 Mya). ERV-WE1, also known as syncytin-1, is assumed to have infected a Catarrhine ancestor 25-40 million years ago [19], although our study reveals a somewhat more recent integration. This is explained by the fact that LTR position for the rhesus macaque ERV-WE1 locus was coincident with a gap in genomic data (Jan. 2006 assembly) and therefore, we couldn't include it in our study. ERV-FRD, also known as syncytin-2, is thought to be much older, over 40 million years [20], and our results support that this endoretroviral integration is quite ancient. The estimated integration time of over 100 million years would suggest that a homologous ERV-FRD could be found on small mammals, but this is not the case. Even though mice possess their own syncytins, syncytin-A and syncytin-B, these do not share a common ancestor with the primate syncytins [21]. The old age of syncytin-2 may also be explained by an induced bias due to the fact that this locus has a slower evolutionary rate than any other studied ERV.
Comparing the basic phylogenetic method with the more computationally intensive MCMC method, we find that for recent (,40 million years) ERVs the predictions of both methods are quite similar. Problems arise for old (.40 million years) ERVs, where the predictions of the phylogenetic method tend to diverge from those of the MCMC with the increasing ERV age. The use of a fixed rate, however, produces results that vary considerably with ERV family and the actual evolutionary rate of the sequence, and should be avoided whenever multi-species phylogenetic data are available. The MCMC calculations can be quite consuming if a big amount of data is needed. For genome-wide studies, the simple phylogenetic approach may constitute a viable and faster alternative, while maintaining a certain level of accuracy.

Sequence mining
We selected several known human endogenous retroviruses and acquired their sequences from the UCSC Genome Browser [22], with the aid of a custom track designed to help the visualization of endoretroviral sequences. A DNA dot plot [23] was used to confirm the presence of long terminal repeats (LTRs). Human endogenous retroviral sequences were used as a template to detect homologous endoretroviral sequences in other primates, whenever possible. We cropped the LTRs and aligned them using the Clustal algorithm [24]. We conducted a phylogenetic analysis of every endoretroviral sequence by building phylogenetic trees of both 39 and 59 LTRs under several models. Endoretroviral sequences that presented

Phylogenetic analysis
Phylogenetic trees for the LTRs were built using the PhyML software, using HKY+G and GTR+G substitution models [25] under the maximum likelihood method, with default parameter values and 1000 bootstrap replicates. We used the mcmctree application in the PAML software package [26] to estimate node ages in the HKY trees, using a non-informative prior and empirical species split time intervals for calibrating the molecular clock model. Branch lengths were automatically extracted from the tree files using the newick tools 0.1 software package [27]. Nucleotide distance was calculated in MEGA 4.0 [28] using the maximum composite likelihood model.

Time of integration estimation
In order to estimate time of integration from basic phylogenetic data, we used the method described in figure 1. Independent 39 and 59 rates were estimated using the maximum composite likelihood model with gamma distributed rates among sites in MEGA 4.0. The gamma shape parameter was estimated for each dataset using the jModelTest substitution model selection functionality. Applying the values for the estimated genetic distances (D) and known speciation times (T) in the formula    figure 1 in order to obtain a final value for the integration time.
The evolutionary rate of the ancestral ERV sequence, corresponding to the branch prior to the last species node, was considered to be, as a simplification, an average of the evolutionary rates across the remaining branches. The HKY+G model was selected after a model fit analysis using the corrected Akaike Information Criteria test. Two of the datasets yielded the K80+G model as the best fit but, since the HKY+G model came in close second in both those cases, the latter model as used throughout the analysis in order to allow for a common framework. The GTR+G model was also included as the next best fit for all the datasets and to serve as a test for congruency of estimations when using a different well fitted substitution model.The MCMC estimation of node ages was performed with mcmctree. Each dataset's phylogeny was assumed to behave as a molecular clock system. Internal nodes of each phylogenetic tree were calibrated with confidence intervals pertaining to specia-cion events -the time intervals used were of 4-6 Mya for the human-chimpanzee node, 12-15 Mya for the human-orangutan node and 23-27 Mya for the human-rhesus macaque node. Node age estimation was performed in triplicate for each sequence set to validate results. Each MCMC chain ran for 75000 steps of which the last 25000 contributed with 250 samples for the estimation.   In blue, pairwise rates for loci estimated to be less than 25 million years old; in red, rates for loci estimated to be more than 25 million years old. The black dashed line represents identical rates between 59 and 39 LTRs. doi:10.1371/journal.pone.0014745.g002