The Paramecium Germline Genome Provides a Niche for Intragenic Parasitic DNA: Evolutionary Dynamics of Internal Eliminated Sequences

Insertions of parasitic DNA within coding sequences are usually deleterious and are generally counter-selected during evolution. Thanks to nuclear dimorphism, ciliates provide unique models to study the fate of such insertions. Their germline genome undergoes extensive rearrangements during development of a new somatic macronucleus from the germline micronucleus following sexual events. In Paramecium, these rearrangements include precise excision of unique-copy Internal Eliminated Sequences (IES) from the somatic DNA, requiring the activity of a domesticated piggyBac transposase, PiggyMac. We have sequenced Paramecium tetraurelia germline DNA, establishing a genome-wide catalogue of ∼45,000 IESs, in order to gain insight into their evolutionary origin and excision mechanism. We obtained direct evidence that PiggyMac is required for excision of all IESs. Homology with known P. tetraurelia Tc1/mariner transposons, described here, indicates that at least a fraction of IESs derive from these elements. Most IES insertions occurred before a recent whole-genome duplication that preceded diversification of the P. aurelia species complex, but IES invasion of the Paramecium genome appears to be an ongoing process. Once inserted, IESs decay rapidly by accumulation of deletions and point substitutions. Over 90% of the IESs are shorter than 150 bp and present a remarkable size distribution with a ∼10 bp periodicity, corresponding to the helical repeat of double-stranded DNA and suggesting DNA loop formation during assembly of a transpososome-like excision complex. IESs are equally frequent within and between coding sequences; however, excision is not 100% efficient and there is selective pressure against IES insertions, in particular within highly expressed genes. We discuss the possibility that ancient domestication of a piggyBac transposase favored subsequent propagation of transposons throughout the germline by allowing insertions in coding sequences, a fraction of the genome in which parasitic DNA is not usually tolerated.

• ρ 1 : fraction of quartets with origin between WGD1 and the present.
• δ 3,1 : survival rate for IESs created before WGD2 during the period between WGD1 and the present.
• δ 2,1 : survival rate for IESs created between WGD2 and WGD1 during the period between WGD1 and the present.

Assumptions
• Independence of IES decay processes within quartets.
• Independence of IES decay processes across quartets.
• Negligible probability of two insertion events occurring at the same place after a WGD event.

Model
We treat the observable quartet type counts as the sum of the number of counts arising from each of three decay processes: one for IESs created before WGD2 that survived to WGD2 (g=3); one for those created between WGD2 and WGD1 that survived to WGD1 (g=2); and one for those created after WGD1 that survived to the present (g=1). Under these assumptions, the probability p g,q (δ) of observing each of the IES patterns q, given that an IES originated in generation g, is given in the following table: The final row of the table is the case of IESs that were eliminated entirely by decay, and are censored observations. Thus, the relative frequency of the other patterns increases from the values in the table above by a factor of 1/(1 − p(q = 0000|g)) Therefore, the relative frequency of the five observables as a function of ρ 2 , ρ 3 , δ 3,2 , δ 3,1 , and δ 2,1 is: p(q = 1010|ρ, δ) = ρ 3 4(δ 3,2 ) 2 (δ 3,1 ) 2 (1 − δ 3,1 ) 2 1 − ((1 − δ 3,2 ) 2 + 2(δ 3,2 )(1 − δ 3,2 )(1 − δ 3,1 ) 2 + (δ 3,2 ) 2 (1 − δ 3,1 ) 4 ) Given these expressions, we can write out a likelihood function for the observed data, given the underlying parameters: The parameters in the model are not fully identified by the data. As a result, there are many maximum likelihood estimates (MLEs Model 4 is constrained with respect to the decay rates, as it assumes that there is a constant rate of decay over time, δ. In this Model, a unique set of parameter values gives the maximum value of the likelihood (-13.52). The justification for this assumption is provided by measurement of the same protein divergence between WGD2 and WGD1 as between WGD1 and the present (cf. Materials and Methods). The parameter values ρ 1 = 0.16, ρ 2 = 0.69, ρ 3 = 0.15 are consistent with a wave of insertions that peaked in the period between WGD2 and WGD1. The value for δ is the same as that obtained for δ 3,1 in the less constrained models (see below).
The three other models are constrained with respect to time of IES insertion, and there are a continuum of parameter values which share the maximum value of the likelihood (-13.52), some of which involve quite different values of several of the parameters. For example, the same value of the likelihood results from ρ 1 = 0.13, ρ 2 = 0.71, ρ 3 = 0.15, δ 2,1 = 0.90, δ 3,1 = 0.91, and δ 3,2 = 0.87. These are quite different parameter values than some of the constrained models, but they fit the data just as well. In particular, there are MLEs involving a wide range of values for all parameters except δ 3,1 , which is well-identified by the data to be in the vicinity of 0.91.
Fortunately, even without the assumption of a constant decay rate over time, we can rule out one hypothesis of interest, which is that that all IES quartets arose before WGD2. Model 3 above corresponds to that hypothesis, and it is strongly rejected by the data. A likelihood ratio test comparing that hypothesis to Model 1, Model 2, or any of the other models that result in log(L) = −13.52 has χ 2 df = 43.48, where df is equal to 1, 2, or 3, depending on which model is used as a comparison. These correspond to p-values of 2.1 × 10 −11 , 1.8 × 10 −10 , and 1.0 × 10 −9 , respectively. Thus, we can quite confidently reject the hypothesis that all the IES creation events occurred before WGD2.