Massive bursts of transposable element activity in Drosophila

The evolutionary dynamics of transposable element (TE) insertions have been of continued interest since TE activity has important implications for genome evolution and adaptation. Here, we infer the transposition dynamics of TEs by comparing their abundance in natural D. melanogaster and D. simulans populations. Sequencing pools of more than 550 South African flies to at least 320-fold coverage, we determined the genome wide TE insertion frequencies in both species. We show that 46 (49%) TE families in D. melanogaster and 44 (47%) in D. simulans experienced a recent burst of activity. The bursts of activity affected different TE families in the two species. While in D. melanogaster retrotransposons predominated, DNA transposons showed higher activity levels in D. simulans. We propose that the observed TE dynamics are the outcome of the demographic history of the two species, with habitat expansion triggering a period of rapid evolution.


Introduction
The question of how the dynamics of transposable elements (TEs) develop over evolutionary time scales has not yet been resolved in a satisfactory manner. Rather, two competing models 1 are proposed: The equilibrium model assumes that genomes are in transposition-selection balance and TEs have a constant transposition rate. With negative selection counteracting the spread of transposable elements (Charlesworth and Langley, 1989;Petrov et al., 2003;Lockton et al., 2008;Petrov et al., 2011;González et al., 2009;Lee and Langley, 2010) the TE composition remains stable over time. By contrast, the non-equilibrium model rests on a variable TE activity over evolutionary time scales Blumenstiel et al., 2013;Le Rouzic et al., 2007;Bergman and Bensasson, 2007;Lerat et al., 2011;El Baidouri and Panaud, 2013). Consistent with this model, bursts of activity have been shown for some TE families (Choulet et al., 2010;Engels, 1992;Diez et al., 2014;Bailey et al., 2003). It is, however, not clear whether bursts are limited to a few TE families within a species, or if they are a general phenomenon of all, or at least the majority of TE families. The controversy about the dynamics of TEs has been particularly hard to resolve from single genomic sequences or population data of a single species, because both models, the equilibrium and the non-equilibrium model, lead to similar predictions. The negative correlation between the number of insertions of a TE family and the average population frequency provides a good example for this (Petrov et al., 2003;. Under the equilibrium model the low average population frequency of abundant TE families is caused by strong purifying selection (Petrov et al., 2003(Petrov et al., , 2011. The same observation, however, can be attributed to a recent burst of TE activity under the non-equilibrium model Bergman and Bensasson, 2007). While the pattern of genome-wide TE insertions and their frequency distribution in a single species are not sufficient to resolve this controversy, a comparison of the TE abundance in two closely related species can do so, since the two models lead to different predictions: assuming a constant TE activity since the split of two closely related species, as purported by the equilibrium model, a very similar TE abundance should result in both species (Petrov et al., 2003). Under a nonequilibrium model bursts of TE activity are independent after the species split, resulting in divergence in TE insertion patterns between the two species. In this study we investigated the TE content in natural D. melanogaster and D. simulans populations, two closely related species which diverged about 2-3 million years ago . Combining empirical TE insertion frequency estimates from Pool-Seq  with computer simulations of TE dynamics we identify for about 78 (84%) TE families a significant deviation from the equilibrium model. About 46 (49%) TE families in D. melanogaster and 44 (47%) families in D. simulans experienced a recent burst of activity. Interestingly, retrotransposon families had a higher rate of transposition bursts in D. melanogaster while DNA transposons were more active in D. simulans. We propose that the high rate of non-equilibrium dynamics may be the result of the recent colonization of these two species.

Results
We compared the TE abundance in natural populations of the two closely related species D.
melanogaster and D. simulans to determine the patterns of long-term transposition rates.
In particular, we used these data to distinguish between a continuous rate of transposition and variable rates, with periods of burst followed by low activity. The comparison of TE abundance in the two species has been complicated by markedly different qualities of the reference genomes and the associated TE annotations. To avoid that the higher quality of the D. melanogaster genome biases our results, we pursued the following strategies: (i) using an improved D. simulans reference assembly (Palmieri et al., 2014), (ii) restricting the TE abundance comparison to regions present in the assemblies of both species (iii) using the same de novo TE annotation pipeline in both species (see Material and Methods) (iv) and employing a TE calling method, which is independent of the presence of a TE insertion in the reference genome. From each species we analyzed isofemale lines collected 2013 in Kanonkop (South Africa). By sequencing pooled individuals (Pool-Seq)   we obtained an average coverage of at least 320-fold using Illumina paired end reads.
We estimated TE abundance from the corresponding average physical coverage of 145 at TE insertion sites, using PoPoolation TE .
A comparison of de novo annotated TEs in D. melanogaster with the reference annotation [FlyBase; v5.53; (Quesneville et al., 2005;Kaminker et al., 2002)], indicated that our pipeline has a high sensitivity as well as a high specificity (supplementary results 3.1). The high quality of our TE annotation is further supported by similar population frequency estimates (Spearman's rank correlation, r S = 0.82, p < 2.2e − 16) and insertion numbers (Spearman's rank correlation, r S = 0.81, p < 2.2e − 16; supplementary results 3.3) in D. melanogaster , between this and a previous study that used the reference annotation .
As final validation of our annotation pipeline we compared the genomic TE distribution in natural populations obtained from our pipeline to an independently acquired data set.    (13,754). A similar observation has been made previously using in situ hybridization data (Dowsett and Young, 1982;Aquadro et al., 1988;. The average population frequency of TE insertions is higher in D. simulans (0.199) than in D.  (Brookfield et al., 1982;Engels, 1992;] will be discussed elsewhere (Kofler et al.; in preparation).
To answer whether the equilibrium or non-equilibrium dynamics predominated in the evolution of TEs of D. simulans and D. melanogaster we require two pieces of information, first the age distribution of TE insertions and second, whether differences in insertion numbers between the two species significantly deviate from equilibrium expectations.
The age distribution is an important parameter describing the dynamics of transposable elements. A direct approach to determine the age of TE insertions is based on the number 4 of mutations after insertion (Bergman and Bensasson, 2007;Malik et al., 1999;Blumenstiel et al., 2002Blumenstiel et al., , 2013, but this method cannot be applied to Pool-Seq data. Nevertheless, the previously demonstrated strong correlation between sequence divergence of TEs and their frequency in a natural population  suggests that population frequencies of TE insertions are good age estimators: young insertions have a low population frequency, while old insertions tend to be fixed. We further scrutinized this relationship by reasoning that young TE insertions are more likely to be expressed. Using RNA-Seq data from D. simulans (Palmieri et al., 2014) we found a highly significant negative correlation (Spearman's rank correlation, r S = −0.34, p = 0.00024) between population frequency and expression intensity. Finally, we reasoned that fixed TE insertions are old and therefore more likely to be shared between species. Indeed, we found fixed TE insertions to be highly enriched for TE insertions shared between D. melanogaster and D. simulans (Fisher's exact test; p < 2.2e − 16; supplementary results 3.6). Thus these fixed TE insertions most likely predate the split between these two species about 2-3 million years ago . Overall, our analyses suggested that the population frequency of TE insertions provides a rough, but suitable estimator for the age of TE insertions.
We performed computer simulations to determine for each TE family whether the observed interspecific differences in TE insertions numbers between species (fig. 2) significantly deviate from expectations under an equilibrium model. Our simulations considered each TE family separately and relied on a fitness function in which fitness decreases exponentially with insertion number, a condition for obtaining stable equilibria (Charlesworth and Charlesworth, 1983). Given the strong influence of population size on TE dynamics (Lockton et al., 2008;Lynch and Conery, 2003) (supplementary fig. 2C), we used a population size ratio in our computer simulations that reflects the ratio of the population variation estimator π (π Dsim /π Dmel = 0.0113/0.0074 = 1.519; supplementary results 3.5). In about 50% (46/93) of the TE families the number of insertions deviated significantly from the expectations under an equilibrium model ( fig. 2). This result was robust with respect to the actual population size employed in the computer simulations (for N e > 10, 000; supplementary results 3.9).
Also when assuming an equal population size of the two species (e.g. (Nolte and Schlötterer, 2008)) substantial deviations from the equilibrium model were identified (supplementary fig.   5 3).
Based on the age distribution and the significance of the difference in insertion number between species, we addressed the question whether equilibrium or non-equilibrium dynamics of TEs predominate in the two species. TE families with a constant activity since the split of D. melanogaster and D. simulans, as purported by the equilibrium model, should have i) insertion numbers between the two species that do not significantly deviate from expectations under an equilibrium model, ii) both old and young insertions in both species and iii) roughly equal amounts of old insertions between the two species and iv) roughly equal amounts of young insertions between the two species. Only 15 (16%) of the TE families showed a TE distribution compatible with the equilibrium model (e.g.: I-element, BS; supplementary results 3.7; fig. 1). The remaining 78 (84%) families did not fit the expectations of the equilibrium model: 36 of them became only recently active (e.g.: rover, Juan), 5 were active only in one species (e.g.: Tom1, pogo), 12 were active only in the past (e.g.: invader4, BS3, INE-1) and 25 families had insertion numbers that differed significantly between the two species (e.g.: Burdock, jockey, hobo; supplementary results 3.7; fig. 1).
The large number of species specific TE activity patterns encouraged us to evaluate the distribution of TEs between two D. melanogaster populations from South Africa and Portugal. Consistent with the non-equilibrium model we observed substantial differences in TE abundance for two families (R1A1-element, gypsy2; supplementary results 3.3). This pattern is in agreement with previous observations (Biémont et al., 2003; suggesting that the TE composition of local Drosophila populations can differ markedly despite little differentiation among cosmopolitan D. melanogaster populations (Caracristi and Schlötterer, 2003).
Given this high incidence of bursts of TE activity, we used the frequency based age estimates to distinguish between bursts of novel TE families (probably introduced by horizontal transfer ); no old insertions) and reactivation of old TE insertions. Reactivated families have old insertions in both species and the insertion numbers in the two species deviate significantly from equilibrium expectation (mostly due to differences in young insertions). We note that a highly active family with a rapid decrease in activity in one species may result in the same signature, but we consider 6 this less likely since it is unclear whether high TE activity can be maintained for extended periods of time (Marí-Ordóñez et al., 2013;Khurana et al., 2011). We identified 28

Discussion
In this report, we provide the first genome-wide characterization of TE abundance in large population samples of the two closely related species D. simulans and D. melanogaster . Consistent with previous reports Lerat et al., 2011), we found considerable differences in TE composition between the two species.
The new high quality TE annotation in D. simulans enabled us to address the longstanding debate about the TE dynamics in natural populations: does the genomic distribution of TEs represent a (stable) transposition-selection equilibrium or is the TE composition changing over time due to bursts of transposition activities (non-equilibrium model) (Charlesworth and Langley, 1989;Petrov et al., 2003;Lockton et al., 2008;Petrov et al., 2011;González et al., 2009;Lee and Langley, 2010;Blumenstiel et al., 2013;Le Rouzic et al., 2007;Bergman and Bensasson, 2007;Lerat et al., 2011;El Baidouri and Panaud, 2013). With genome-wide TE insertion frequency data from populations of two species, we were able to distinguish these two hypotheses. Only 15 out of 93 TE families (16%) showed a pattern of TE abundance for which the equilibrium model could not be rejected. This is probably a conservative estimate since some episodes of TE burst may not be identified in our tests. For example, the distribution of I-element insertions seems to fit the equilibrium model, but it was suggested that active copies of the I-element were lost in D. melanogaster already some time ago, and that active copies only recently reinvaded extant populations (Bucheton et al., 1992). We note however, that another study did not find evidence for horizontal transfer of the I-element .
Our strong support for a non-equilibrium model with massive bursts of TE activity depends to a large extent on our ability to date TE insertions correctly. In absence of sequence divergence data we relied on population frequency as an indicator of age. In addition to several lines of evidence supporting this approximation, it is remarkable that our results agree very well with the published literature: INE-1, jockey2, helena, Cr1a (but not baggins) were mainly active in the distant past (> 3 mya) in D. melanogaster Bergman and Bensasson, 2007), while 17.6, Stalker4, rover, flea, copia, mdg3, roo, Transpac, opus, blood, 412, Burdock, diver, Tirant, Juan, Doc were mostly active recently (< 100, 000 ya) (Bergman and Bensasson, 2007;Engels, 1992). An example for some conflict with published evidence is the ZAM element, which was previously classified as an ancient element in D. melanogaster (Baldrich et al., 1997) but we identified only young insertions.
Interestingly, differences in TE composition are not only recognized in between species comparisons, but can be also detected between two D. melanogaster populations (supplementary results 3.3). These differences can not be explained by demographic events alone which should affect all TE families equally, whereas we only found marked differences for two TE families. Such differences in TE abundance between populations have also been observed in D. simulans (Biémont et al., 2003). In combination with the compelling evidence for bursts of TE activities in other species (Choulet et al., 2010;Engels, 1992;Diez et al., 2014;Bailey et al., 2003), we conclude that the TE composition in D. simulans and D. melanogaster is probably highly dynamic and changes quickly, such that even differences between populations can be detected. In fact at least 46 (49%) TE families in D. melanogaster and 44 (47%) in D. simulans have experienced a recent burst of transposition activity.
It is not clear what triggers bursts of TE activity. One interesting hypothesis suggests that habitat expansions could induce TE bursts Vieira and Biémont, 2004;Vieira et al., 2012). With TE insertions frequently contributing to adaptation to novel environments (Casacuberta and González, 2013;González et al., 2008),

TE bursts may be an important component of successful habitat expansions. Colonization
of new environments may trigger bursts of TE activity by two, not mutually exclusive mechanisms. Either stress associated with new environments disturbs the guarding system, such as piRNA, or the habitat expansion may bring species into contact, which never met before.
In combination with horizontal transfer of TEs, this could result in activity of a TE in a new host (Engels, 1992;Plasterk et al., 1999). One classic example for this scenario is the transfer of the P-element from D. willistoni to D. melanogaster, which invaded the territory of D. willistonii in South America (Engels, 1992). After the horizontal transfer, the P-element rapidly spread in D. melanogaster populations worldwide (Anxolabéhère et al., 1988). However, upon the activation of a single TE family, previously dormant families may become reactivated, as it has been noted during hybrid dysgenesis (Khurana et al., 2011;Petrov et al., 1995), where DNA damage mediated stress seems to be causative (Petrov et al., 1995;McClintock, 1984;Khurana et al., 2011). While we cannot pinpoint the most important mechanism for the burst of TE insertions, our data clearly indicate the central role of horizontal transfer. We identified transposition bursts of several novel TE families (28 in D. melanogaster and 35 in D. simulans), which are unlikely to be the outcome of de novo birth of TEs. Our results are fully compatible with previous work emphasizing the importance of horizontal transfer of TEs for the evolution of eukaryotic genomes Schaack et al., 2010;Gilbert et al., 2010). One key assumption for the habitat expansion mediated activity burst scenario is that the South African population does not represent an ancestral African population. In D. melanogaster , South African populations have been described to have high similarity to cosmopolitan ones (Pool and Aquadro, 2006). We also note that the level of polymorphism in our South African D. simulans population is more similar to a Portuguese population (F ST = 0.030; a detailed analysis will be published elsewhere) than to a central African population (F ST = 0.055; data from (Nolte and Schlötterer, 2008)). However, even very limited admixture from cosmopolitan flies would probably have been sufficient to introduce active TE families, resulting in a burst of TE  Malik et al., 1999). Furthermore, the integration of DNA transposons requires only transposase and no specific host factor, which makes these TEs potentially more successful invaders of diverged genomes Plasterk et al., 1999). The very recent out of Africa habitat expansion of D.
simulans (Capy and Gibert, 2004) about 100 years ago is therefore consistent with the predominance of bursts of DNA transposon. D. melanogaster , on the other hand, colonized already more than 10, 000 years ago (Stephan and Li, 2007), providing sufficient time for less invasive retrotransposons to colonize a new host. Furthermore, if D. melanogaster experienced a burst of DNA TEs shortly after the colonization, the host defense system (e.g.: the piRNA system (Levin and Moran, 2011)) may have matured to control the initially invading DNA TEs. Under this scenario, the genomic TE signature in D. simulans is expected to experience a transition from DNA TE bursts to retrotransposon bursts in the next couple of centuries.

Fly samples and sequencing
We collected 1,300 isofemale lines of D. simulans and 1,250 isofemale lines of D. melanogaster from Kanonkop (South Africa) in 2013. The lines were kept in the laboratory for 8 generations. We used a single female from 793 (554) isofemale D. simulans (D. melanogaster ) lines for pooling. Genomic DNA was extracted from pooled flies using a high salt extraction protocol (Miller et al., 1988) 1). For each library we selected for a narrow insert size, ranging from 260-340, using agarose gels. A total of five lanes 2x100bp paired-end reads were sequenced on a HiSeq2000 simulans.

Annotation of TE insertions
One of the requirements for estimating the abundance of TE insertions with PoPoolation TE ) is a reliable TE data base. A manually curated high-quality annotation of TE insertions has been generated for D.melanogaster (Kaminker et al., 2002;Quesneville et al., 2005) , whereas, to our knowledge, so far no TE annotation of comparable quality exists for D. simulans. To avoid any biases that may result from using TE annotations of different qualities we decided to de novo annotate TE insertions in both species with an identical pipeline. The reference sequence of D. melanogaster (v5.53) was obtained from FlyBase (http://flybase.org). We used the reference sequence published by (Palmieri et al., 2014) for D. simulans, as this assembly is of a higher quality than the previous available one (Begun et al., 2007) and of similar quality as a recently published one (Hu et al., 2013). We also obtained a library containing the consensus sequences of Drosophila TEs (transposon sequence set.embl; v9.42; (Quesneville et al., 2005) (Permal et al., 2012) (-gccalc -s -cutoff 200 -no is -nolow -norna -gff -u), yielding a raw annotation of TE insertions. The consensus sequences of several TE families contain microsatellites which may, as an artefact, be annotated as TE insertions (Permal et al., 2012;Quesneville et al., 2005). To account for this, we identified microsatellites in both reference genomes with SciRoKo 3.4 (Kofler et al., 2007) (required score 12; mismatch penalty 2; seed length 8; seed repeats 3; mismatches at once 3), converted the output into a 'gtf' file and removed TEs from the raw annotation that overlapped with a microsatellite over more than 30% of the length using bedtools (v2.17.0; intersectBed -a rawannotation.gff -b microsatellites.gff -v -f 0.3) (Quinlan and Hall, 2010). Overlapping TE insertions of the same family were merged and disjoint TE insertions of the same family were linked using an algorithm that, similar to dynamic programming, maximizes the score of the linked TE insertions (match − score = 1, mismatch − penalty = 0.5). We resolved overlapping TE insertions of different families by prioritizing the longest TE insertion and iteratively truncating the overlapping regions of the next longest insertions. Finally we filtered for TE insertions having a minimum length of 100 bp.

Estimating the abundance of TE insertions with PoPoolation TE
Estimating the abundance of TE insertions with PoPoolation TE requires paired end sequences from natural populations, a reference sequence, an annotation of TE sequences and a hierarchy of the TE sequences . We extracted the hierarchy of TE sequences from the database of consensus TE sequences (v9.42; see above). We extracted the sequences of the annotated TE insertions from the reference genomes into a distinct file and subsequently masked these TE sequences within the reference genome with the character 'N'.
We than concatenated the individual fasta records of (i) the consensus sequences of TE insertions, (ii) the TE sequences extracted from the reference genome and (iii) the repeat masked reference genome into a single file, which we call TE-merged-reference. Short read mapping software usually only allows for a few mismatches between read and reference genome which may lead to underestimate the abundance of some TE insertions, especially when the TE sequences are highly diverged . Such a high divergence between reads and the reference sequences may also result when the consensus sequences of TE families are derived from a different species. This could lead to underestimate the abundance of TE insertions in D. simulans when using consensus sequences that are mostly derived from D.
melanogaster . Therefore, we improved the sensitivity of our pipeline for D. simulans by including TE sequences extracted from the assemblies of Begun et al. (2007), Palmieri et al. (2014) and Hu et al. (2013) (using the same TE annotation pipeline as described above) into the TE-merged-reference of D. simulans.
We mapped 364 million PE fragments of D. melanogaster and 288 million PE fragments of D. simulans (see above) to the respective TE-merged-reference with bwa (v0.7.5a) ) using the bwa-sw algorithm   (supplementary tables 2, 3). We used 'samro' to restore the paired end information . We estimated the abundance of TE insertions with PoPoolation TE similarly as described in  using the following settings: identify-te-insertions.pl -te-hierarchy-level family, -min-count 3, -min-map-qual 15, -narrow-range 100; crosslink-te-sites.pl -min-dist 85, -max-dist 300; estimate-polymorphism.pl -te-hierarchy-level family, -min-map-qual 15; Subsequently we filtered for TE insertions located on the major chromosome arms (X, 2L, 2R, 3L, 3R, 4) and for TE insertions having a minimum physical coverage of 30 (physical coverage as defined here is the sum of paired end fragments that either confirm the presence or the absence of a TE insertion). An unbiased comparison of the abundance of TE insertions between different species requires similar physical coverages in all species. We therefore iteratively subsampled paired-end fragments and repeated TE identification with PoPoolation TE, until we obtained similar physical coverages in both species (supplementary   table 4).

Estimating nucleotide polymorphism
We estimated genome-wide levels of nucleotide diversity in the two natural populations using Pool-Seq data and PoPoolation (Kofler et al., 2011a). First, we aligned all reads to the respective reference genome (unmodified) with bwa aln (0.7.5a)   For estimating F ST between D. simulans populations we used PoPoolation2 (Kofler et al., 2011b) and the following parameters -min-qual 20 -window-size 1 -step-size 1 -pool-size 500 -suppress-noninformative -min-covered-fraction 1 -max-coverage 600,100,100 -min-count 4 -min-coverage 10; Only autosomes were used for calculating the F ST .

Expression level of transposable element families in D.simulans
To measure the expression level of different TE families in D. simulans we obtained previously published RNA-seq reads (Palmieri et al., 2014), derived from a mix of several developmental stages of D. simulans strain M252. The reads were trimmed with PoPoolation v1.2.2 (trim-fastq.pl) (Kofler et al., 2011a) using the following parameters: -fastq-type illumina, -quality-threshold 20, -min-length 40; We mapped the RNA-seq reads to a database consisting of the repeat masked reference genome of D. simulans (Palmieri et al., 2014) and the library of TE sequences derived from all three assemblies of D. simulans (see above).
Reads were mapped with bwa (v0.7.5a) ) using the bwa-sw algorithm . Unambiguously mapped reads (mapping quality ≥ 15) were filtered 14 with samtools (v0. 1.19)  ). Subsequently we counted the number of reads mapping to each TE family and normalized counts by the length of the consensus sequence (transposon sequence set.embl; v9.42; see above).

Orthologous regions between D. melanogaster and D. simulans
The assemblies of D. melanogaster and D. simulans are of different quality, for example varying in the amount of assembled heterochromatin. An unbiased analysis of TE abundance should therefore be restricted to genomic regions being present in the assemblies of both species. We identified these regions by aligning the genomes of D. melanogaster (v5.53) and D. simulans (Palmieri et al., 2014) with MUMmer (v3.23; nucmer) (Kurtz et al., 2004).
To avoid spurious alignments we masked all sequences derived from TEs in both reference genomes (see above) prior to the alignment. Coordinates were extracted with the 'showcoords' tool (Kurtz et al., 2004) and only alignments of the major chromosome arms (X, 2L, 2R, 3L, 3R, 4) were considered. Due to the masking of TE sequences these raw alignments contain a plenitude of gaps where the TE insertions actually causing the gaps, are not found in genomic regions that are present in the alignment. To mitigate this we linked these gaps by merging alignments not separated by more than 20.000bp in both species. This threshold of 20.000bp has been arbitrary chosen because only six of the masked regions in the repeat-masked genome of D. melanogaster have a size larger than 20.000bp. 2010) for windows of 1000kb. We excluded the X-chromosome and low recombining regions -including the entire chromosome 4 -from the analysis. In summary we performed our simulations with T = 68, 700, 000 TE insertions sites (distributed over the following genomic regions 2L:300000-16600000, 2R:3900000-20700000, 3L:900000-17400000, 3R:6600000-25700000) where every insertion site may either be empty or occupied. In our model, every TE insertion has a constant probability of transposing to a novel site v and excision events (u = 0) were not considered. Novel TEs were randomly inserted in any of the T insertion sites at any of the two haploid genomes. If an insertion site was already occupied the transposition event was ignored. For any individual i in a population of size N the fitness w i can be calculated as w i = 1 − xg t i , where g i is the number of TE insertions, x is the selective disadvantage of each insertion and t represents the interactions between the insertions (Charlesworth and Charlesworth, 1983). This is a model where all TE insertions exert a semi-dominant effect (Charlesworth and Charlesworth, 1983).

Modeling TE abundance in populations under an equilibrium model
Per default we used x = 0.0004 and t = 1.3 in our simulations. We furthermore used fecundity selection, where any individual has a probability of mating p i that linearly scales with fitness w i (p i = w j /Nw;w is the average fitness; after Le Rouzic et al. (2007)).
We simulated evolving populations with non-overlapping generations, proceeding at every generation in the following order: First N random pairs were picked according to the mating probability p i , where selfing was excluded. Second, each parent contributed a single gamete to the offspring wherein crossing over events were introduced according to the specified recombination rate (see above). Third, fitness of the offspring w i was calculated from the abundance of TE insertions in the resulting genome of the offspring. And fourth, transposition events were introduced according to the transposition rate v. Note that the novel TE insertions will only contribute to fitness in the next generation. This could for example be interpreted as TE activity in the germline which will mostly also only effect the next generation (i.e.: the offspring). In all simulations, we performed forward simulations for 10, 000 generations. We noted that if a stable equilibrium could be reached (e.g.: no increase in the number of fixed insertions), it took less than 5, 000 generations. Conservatively we used 10, 000 generations in our simulations. To match the analysis of natural populations we also sampled 145 haploid genomes after the 10, 000 generations and required a minimum count of 3 to identify a transposable element (see above).

Constant population size
In order to estimate the expected variance in TE copy number under an equilibrium model and an constant population size, we performed forward simulations for populations of N = 10.000 diploid individuals. We performed 10,427 individual forward simulations with transpositions rates randomly sampled from a uniform distribution between v = 0.0 − 0.003.
These simulations required approximately 10, 000 CPU hours. Different TE families may have markedly different transposition rates (Charlesworth and Langley, 1989) which will result in different equilibrium copy numbers. We therefore identified for every TE family (j) the most likely transposition rate v that maximizes the probability of observing both the TE copy number of D. melanogaster (c m j ) and of D. simulans (c s j ). To do this, we grouped the simulation results based on the transposition rate v into i overlapping windows (W i ∈ W ) with a window size of 10 −4 and a step size of 10 −5 and fitted, for every window, a normal distribution to the data (N i (µ i , σ 2 i ) with mean µ i and standard deviation σ 2 i ). The probability that a given number of TE insertions (c) can be explained by the transposition rate of window W i is than given by P (c|W i ) = 1 − P (µ i − |µ i − c| < x < µ i + |µ i − c|) which can be easily computed from N i .
Next we identified for every TE family (j) the window (W j max ) that maximizes the probability of observing c s j and c m j as W The corresponding transposition rate of this window will also be the maximum likelihood estimate of v. Finally the probability of observing both c s j and c m j with a constant transposition rate as found in window W j max can be computed as P (c m j , c s j |W j max ) = P (c m j |W j max )P (c s j |W j max ). We tested every TE family for significance using Bonferroni correction to account for multiple testings.

Varying population size
In order to include demography into our model of TE dynamics we estimated differences in effective population sizes by comparing the level of nucleotide polymorphism in D.
melanogaster and D. simulans. We found that D. simulans has a 1.519 higher effective population size than D. melanogaster . Accordingly, we performed forward simulations with two different population sizes where the larger population (N = 10.000) represents D. simulans and the smaller population (N = 6, 583; ≈ 10000/1.519) represents D. melanogaster .
Differences in TE insertions between these two species were assessed as described above.
The only difference was that, for every window (i) we fitted two separate normal distriubtions to the data, one for D. melanogaster (N m i (µ m,i , σ 2 m,i )) and one for D. simulans (N s i (µ s,i , σ 2 s,i )). The probability that a given number of TE insertions in D. melanogaster (c m ) can be explained by the transposition rate of the given window (W i ) can be calculated as , and accordingly, the probability that the number of TE insertions in D. simulans can be explained by the transposition rate in the same window as P (c s |W i ) = 1 − P ((µ s,i − |µ s,i − c| < x < µ s,i + |µ s,i − c|)|N s i ). Finally, the maximum likelihood window and the probability of observing both TE counts with the window-specific transposition rate were computed as described above. Again, we used Bonfferoni correction to account for multiple testing.

Data availability
The short reads (European Nucleotide Archive; http://www.ebi.ac.uk/ena; PRJEB6673) and the TE annotations (https://code.google.com/p/popoolationte/wiki/pdms) are publicly available. All scripts and the entire protocol used for this work ar also available (https://code.google.com/p/popoolationte/wiki/pdms) Figure 1: Distribution of TE insertions in a natural population of D. melanogaster (dm) and of D. simulans (ds). The TE distribution (outer graph) is compared to the recombination rate (middle graph) and the nucleotide polymorphism (Θ π , yellow inner graph). TE abundance and recombination rate are shown for windows of 500kb, whereas the nucleotide diversity is shown for windows of 100kb. For overlapping TE insertions (white) no estimates of population frequencies could be obtained. The relationship between the reference genomes is shown in the inside. Note, the inversion on chromosome 3R (Sturtevant, 1921)    Lane Barcode BGI-91a BGI-92a BGI-92b BGI-92c BGI-93a CGATGTAT Dsim-r1 Dsim-r1 Dsim-r1 Dsim-r1 Dsim-r1 TGACCAAT Dsim-r2 Dsim-r2 Dsim-r2 Dsim-r2 Dsim-r2 GCCAATAT Dmel-r1 Dmel-r1 Dmel-r1 Dmel-r1 Dmel-r1 CTTGTAAT Dmel-r2 Dmel-r2 Dmel-r2 Dmel-r2 Dmel-r2 Table 2: Mapping statistics for D. melanogaster ; All numbers are counts of paired end (pe) fragments; both same chr.: both reads mapped to the same reference chromosome; both same TE: both reads mapped to the same TE; on different chr.: reads mapped to different reference chromosomes; TE inserts: one read maps to a TE and the other to a reference chromosome; fwd. insertions: forward insertions; rev. insertions: reverse insertions for details see

Sensitivity and specificity of the TE annotation pipeline
To test the performance of our pipeline for the de novo annotation of TE insertions we used the reference annotation of D. melanogaster (FlyBase v5.53) as 'gold standard' and asked whether our pipeline reproduces this reference annotation. We excluded peri-centromeric regions that have, so far, not been annotated for TE insertions (2R:>22,420,241bp, 2L:<387,345bp, 3L:>23,825,333bp; Casey Bergman personal communication). We found that our de novo annotation pipeline has a high sensitivity and specificity both at the nucleotide level and the level of individual TE insertions (table 6).

Quality control for species pools
We used a very large number of individuals (> 500) from both species to establish isofemale lines and subsequently the pools used for this study. Since D. melanogaster and D. simulans are phenotypically similar, two different people checked each isofemale line. Since, it is possible that an error occurred in the species identification, we decided to additional check the sequenced pools. We compiled a set of 9,491 SNPs on chromosome 4 that are fixed for different alleles in the two species (R. Tobler, pers. communication). Using these SNPs we found that 0.042% of the base calls in the D. melanogaster library are identical to the allele fixed in D. simulans, which is close to the fraction of sequencing errors in this library (0.035%). Similarly for the D. simulans library we found that 0.014% of the base calls are identical to the allele fixed in D. melanogaster , which is again similar to the level of sequencing errors in this library (0.018%). The level of sequencing errors was estimated as the fraction of base calls at these 9,491 SNPs that are neither identical to the allele fixed in D. melanogaster nor to the allele fixed in D. simulans. We therefore conclude that each of the pools of individuals was derived from a single species only.
3.3 Reproducibility of results of  We also tested whether estimates of TE abundance generated with our de novo TE annotation match previously published results, that were generated with the reference annotation .   We note that some deviation in the TE abundance between these two samples are expected . This good agreement between estimates of TE abundance -despite different annotations, different geographic origins of the populations, different read length (here 100 vs 74), different library preparation methods -suggests that our approach yields highly reliable estimates of TE abundance. In agreement with this, the reliability of PoPoolation TE was recently also confirmed by a simulation study (Zhuang et al., 2014). However, some TE families show marked differences in the numbers of TE insertions between Portugal and South Africa ( fig. 4). While the lack of P-element insertions in the population from Portugal can simply be explained by the fact that P-elements were not considered in the study of , the higher copy numbers of R1A1-elements (South Africa 746; Portugal 11) in South Africa and of gypsy2 (South Africa 14; Portugal 50) elements in Portugal, may be due to different activities of these TE families in the two populations. a given insertion in a random genome). We did not include TE families for which  and our study, did not find a single insertion (osvaldo, gandalf, telemac, bilbo), as inclusion of such families may lead to artificially inflated correlations. We also did not include P-element and mariner insertions, as abundance of these two families were not directly estimated by . Overall we found a striking correlation between TE abundance estimated in this study and the study of , both for D. melanogaster (Spearman's rank correlation; r S = 0.85, p = 3.6e − 9) and D. simulans (Spearman's rank correlation; r S = 0.62, p = 0.0002). We note that absolute insertion numbers cannot be compared directly, as  excluded pericentromeric regions and provided the TE abundance per diploid genome, whereas we only analyzed regions being present in the assemblies of both species and provided the TE abundance per haploid genome.   and by this study.

Nucleotide polymorphism
We estimated the nucleotide polymorphism in the natural D. melanogaster and D. simulans populations with a sliding window approach, using non-overlapping windows of 1 kb (see material and methods in main manuscript). Subsequently we (i) filtered for regions being present in both assemblies, (ii) excluded regions with low recombination rates (< 1 cM/M b) in D. melanogaster and (iii) excluded windows overlapping with an exon (Table 8). Based on the level of polymorphism in the autosomes we estimate that the N e of D. simulans is approximately 1.519 (= 0.011285/0.007429) times higher than the N e of D. melanogaster .

Age of fixed TE insertions
Reasoning that fixed TE insertions are mostly old, we asked whether fixed insertions are enriched for insertions shared between D. melanogaster and D. simulans. Shared insertions should mostly predate the split of these two species, which occurred approximately 2-3 million years ago . To do this, we first generated a set of TE insertions that may potentially be shared between these two species by reciprocally aligning 1000bp regions flanking the TE insertion, both at the 5' and the 3' end, to the reference genomes of D. melanogaster and D. simulans using bwa-sw (v0.7.5a) . We only retained TE insertions (i) where the flanking regions could be that are potentially shared between the two species, i.e.: insertions in non-repetitive regions and insertions in regions that are present in the assemblies of both species (table 9). Actually shared insertions where subsequently identified by scanning for TE insertions of a given family having, in both species, insertion positions within the boundaries of these flanking regions. Note that this procedure allows for some degree of uncertainty in the exact insertion position of the TEs (as for example advisable when using PoPoolation TE). We generated an additional data set excluding the P-element, which has a strong insertion bias (Spradling et al., 2011) that may potentially bias our results. To ensure that any significant enrichment is not solely based on INE-1, a very old TE family that has not been active for >3 million years Sackton et al., 2009;, we also generated a data set excluding INE-1 insertions (table 9).
We found that, for all three data sets (  3).

only recent activity in at least one species; non-equilibrium
To this group we assign families that are only active since very recently in at least one species. No past activity, either in one or in both species, could be detected. With respect to supplementary fig. 3 this translates to families having just blue insertions (young: low population frequency) in at least one species.

inactive in at least one species; non-equilibrium
To this group we assign families that are inactive in at least one species. Allowing for some small margin of methodological error (PoPoolation TE may, due to false absence reads, occasionally underestimate the population frequency of TE insertions ), this translates to families having mostly red (old: fixed) insertions in at least one species (supplementary fig. 3).
With respect to supplementary fig. 3 this translates to families having red and blue insertions in the two species, but with significantly different copy numbers.

similar activity in both species; equilibrium
To this group we assign families that were active for some time (having old and young insertions) and that furthermore have copy numbers in the two species not significantly deviating from expectations under an equilibrium model. This translates to families having similar amounts of blue and red insertions in both species (supplementary fig. 3).

TE families showing signatures of recent bursts
Using the approximation suggested in the manuscript -TE insertions having low population frequency are mostly young whereas fixed insertions are mostly old -we can roughly distinguish between two different types of bursts of TE activity.