Transposon Variants and Their Effects on Gene Expression in Arabidopsis

Transposable elements (TEs) make up the majority of many plant genomes. Their transcription and transposition is controlled through siRNAs and epigenetic marks including DNA methylation. To dissect the interplay of siRNA–mediated regulation and TE evolution, and to examine how TE differences affect nearby gene expression, we investigated genome-wide differences in TEs, siRNAs, and gene expression among three Arabidopsis thaliana accessions. Both TE sequence polymorphisms and presence of linked TEs are positively correlated with intraspecific variation in gene expression. The expression of genes within 2 kb of conserved TEs is more stable than that of genes next to variant TEs harboring sequence polymorphisms. Polymorphism levels of TEs and closely linked adjacent genes are positively correlated as well. We also investigated the distribution of 24-nt-long siRNAs, which mediate TE repression. TEs targeted by uniquely mapping siRNAs are on average farther from coding genes, apparently because they more strongly suppress expression of adjacent genes. Furthermore, siRNAs, and especially uniquely mapping siRNAs, are enriched in TE regions missing in other accessions. Thus, targeting by uniquely mapping siRNAs appears to promote sequence deletions in TEs. Overall, our work indicates that siRNA–targeting of TEs may influence removal of sequences from the genome and hence evolution of gene expression in plants.


Introduction
While transposable elements (TEs) constitute a large fraction of plant, animal and human genomes [1][2][3], their contribution to genome size can change rapidly during evolutionary time. In some taxa, TEs have been responsible for two-fold differences in genome size that arose over a few million years or less. These rapid fluctuations, which may be due to TEs being either more active or more efficiently deleted in certain species, indicate that control of TEs can differ greatly between closely related plant species [4][5][6][7]. The balance between TE transpositions and selection against TEs is influenced by factors ranging from mating system to silencing by short interfering RNAs (siRNAs) and chromatin modification. Therefore the control of TE activity and the removal of transposed copies can be considered key factors in the evolution of genomes.
TEs are often regarded as genomic parasites due to the potentially detrimental effects of insertional inactivation of genes and ectopic recombination of DNA [8]. Twenty-four nt long siRNAs are associated with most TEs as part of a 'double-lock' mechanism of siRNA-mediated DNA methylation that controls transposition via transcriptional repression, with a reinforcement loop between DNA methylation, histone methylation and siRNAs [reviewed in 9]. siRNAs are a robust proxy for DNA methylation at TEs, with unmethylated TEs generally lacking matching 24 nt siRNAs [10][11][12][13]. Most plant TEs have cytosine methylation at CG, CHG and CHH sites, but a quarter is unmethylated and a further 15% have atypical methylation patterns. In the TE-dense heterochromatin, DNA methylation can spread about 500 bp into neighboring unmethylated TEs [13]. In the euchromatin, methylation spreads from TEs to approximately 200 bp beyond the siRNA target sites [13], consistent with the effect of siRNAs on expression of proximal genes dissipating by 400 bp [14]. siRNAtargeted, methylated TEs are, on average, located farther away from expressed genes than TEs that are not strongly methylated or associated with siRNAs [13,15]. As expected from this correlation, siRNA-targeted TEs have more effects on nearby gene expression than those without [14,15].
Most poorly methylated TEs are short and have few CG dinucleotides [13]. This indicates a progression over evolutionary time from TEs that are active and targeted by siRNA-mediated DNA methylation, to inactive, degenerate relics that have changed through deletions and nucleotide substitutions initiated by deamination of methylated cytosines. These inactive TEs are then no longer targeted by siRNA-mediated DNA methylation.
Presumably because of interference with cis-regulatory elements, Arabidopsis TEs reduce the average expression levels of adjacent genes, although the distance over which these effects are noticeable varies between A. thaliana and A. lyrata [14]. Differences in TEs next to genes contribute to the divergence of gene expression levels between orthologs in these closely related species [14], and gene expression is negatively correlated with the number of nearby siRNA-targeted, methylated TEs [15].
In the selfing species A. thaliana, TEs account for only a fifth of the genome [7,13,16], making it relatively depauperate of TEs. Given that the A. thaliana genome is small relative to other members of the family and that its close relative A. lyrata, an outcrosser, contains approximately three times as many TEs [14], deletion of TEs in A. thaliana is likely an ongoing, active process. In accordance with this hypothesis, intraspecific polymorphisms and deletions in A. thaliana are disproportionately located within TEs and, to a lesser extent, intergenic regions [17][18][19].
A reference-guided assembly approach has been applied to accurately characterize complex sequence variation in several A. thaliana accessions [19]. Here, we exploit this information to examine TE variants and their effect on the expression of nearby genes in three divergent accessions. We report that TEs are more likely to be located in polymorphic regions of the genome. Where TEs are present in less polymorphic regions, they also tend to be less polymorphic themselves. Although polymorphic TE variants are less abundantly targeted by siRNAs, uniquely mapping siRNAs targeting polymorphic TE variants are strongly correlated with the TE regions that vary between accessions. These findings suggest a link between the ability to tolerate TE insertions, siRNAmediated silencing and purging of TEs by deletion.

TE variation across the genome
We annotated the sets of genes and TEs in three A. thaliana accessions: Col-0, Bur-0 and C24 [19,20]. For reference accession Col-0, we used the TAIR9 annotation of TEs and protein-coding genes. Excluding centromeric sequences, 21,913 full-length and degenerate TEs and 26,541 genes were considered further. We built genome templates of Bur-0 and C24 from re-sequencing data using the SHORE pipeline [21]. The reference coordinates of TEs and genes were projected onto these genome templates, and variation in TEs and genes was determined based on single nucleotide polymorphisms (SNPs), 1 to 3 bp insertions/deletions (indels) and larger deletions of 4 to 11,464 bp (median 30 bp, mean 113 bp). Larger insertions were not included because of the high false-negative rate [17].
Comparison of polymorphism densities confirmed that coding regions were relatively depauperate of SNPs, indels and large deletions compared to intergenic regions and TEs (binomial test, p[Coding Regions/Intergenic Region] = 0 and p[Coding Regions/TE] = 0 for SNPs, indels or large deletions). Large deletions were significantly over-represented in TEs compared to intergenic regions, while SNPs and indels were not ( Figure S1a; binomial test, p[TE/Intergenic Region] = 0 for large deletions). Over 6% of reference TEs differed by at least 10% of total length in each of the two accessions, Bur-0 and C24, compared to Col-0 ( Figure 1a and Figure S2). Almost all of this variation, 93%, was due to large deletions ( Figure S1b; for distribution of large deletion sizes see Figure S1c). We defined TEs with at least 10% variation by length (SNPs, indels and larger deletions combined), but not completely missing in Bur-0 or C24, as TE variants or VarTEs (please also see Figure S3 for abbreviation definitions). Close to 40% of VarTEs were shared between Bur-0 and C24 ( Figure S4a).
TE density is highest in and next to the centromeres, where there are few genes. The fraction of VarTEs and the average level of TE variation were higher in the pericentromeric regions than on the gene-dense chromosome arms (Figure 1b; Mann-Whitney U [MWU] test, p,2610 216 for Col-0 versus Bur-0/C24, Table  S1 and Figures S5 and S6). To examine whether gene proximity biases TE variation across the chromosomes, we calculated the distance between TEs and protein-coding genes for Col-0. TEs were separated into two subsets: TEs within 2 kb of any gene, subsequently called proximal TEs, and TEs at least 2 kb away from the closest gene, called distal TEs. Distal TEs were on average more variable than proximal TEs (Figure 1c; Figures S7 and S8; MWU p[Col-0/Bur-0] = 0.001, p[Col-0/C24],6610 25 ). Proximity to protein-coding genes may therefore influence TE variation, consistent with TEs closer to genes likely being under stronger selective constraint [15,22].
The correlation between TE variation and proximity to genes was compared among TE superfamilies [23,24]. For noncentromeric TEs, LTR retrotransposons were more distal from genes, while no significant difference in distance to genes was observed for other TE superfamilies (Table S2). However, for proximal TEs there were differences among TE superfamilies in distance to genes and, as expected, TE superfamilies that are closer to genes (e.g. CACTA, MITE) were less variable than superfamilies located farther away from genes, e.g. non-LTR retrotransposons (Table S2).
To investigate the link between TE and proximal gene variation, we examined whether TE variation and location correlated with the polymorphism level of neighboring genes. We used the small-scale mutations to calculate the polymorphism level of non-centromeric genes. For each accession, genes were separated into two subsets; TE+ genes included genes within 2 kb of a TE and genes with TEs anywhere within the transcribed region, while TE-genes were at least 2 kb from the closest TE (Table S3). To be conservative, any TEs in Bur-0 or C24 with predicted deletions of at least 10% of the reference length were annotated as deleted. TE+ genes were on average more polymorphic than TE2 genes in each accession (Figure 2a; MWU p,2610 216 for Col-0, Bur-0 and C24). The same analysis was repeated for 80 resequenced A. thaliana accessions [17]; we could confirm the correlations observed with Bur-0 and C24 in these accessions.
Since polymorphism levels vary enormously among gene families, we further investigated whether there is a correlation of TE proximity with gene family using small-scale mutations from

Author Summary
Transposable elements (TEs) are selfish DNA sequences. Together with their immobilized derivatives, they account for a large fraction of eukaryotic genomes. TEs can affect nearby gene activity, either directly by disrupting regulatory sequences or indirectly through the host mechanisms used to prevent TE proliferation. A comparison of Arabidopsis thaliana genomes reveals rapid TE degeneration. We asked what drives TE degeneration and how often TE variation affects nearby gene expression. To answer these questions, we studied the interplay between TEs, DNA sequence variation, and short interfering RNAs (siRNAs) in three A. thaliana strains. We find sequence variation in genes and adjacent TEs to be correlated, from which we conclude either that TEs insert more often near polymorphic genes or that TEs next to polymorphic genes are less efficiently purged from the genome. We also noticed that processes that cause deletions within TEs and ones that silence TEs appear to be linked, because siRNA targeting is a predictor of sequence loss in accessions. Our work provides insight into the contribution of TEs to gene expression plasticity, and it links TE silencing mechanisms to the evolution of TE variation between genomes, thereby linking TE silencing mechanisms to expression plasticity.
To further examine the effects of TE variants on proximal genes, we divided TE+ genes into two subsets: genes where flanking TEs were ,10% variant (Invariant TEs: InvTE) among the three accessions (InvTE+ genes), and genes where at least one flanking TE showed $10% sequence (VarTE) variation between accessions (VarTE+ genes; Table S3). Three quarters of VarTE+ genes were shared in comparisons between Col-0 and Bur-0 or Col-0 and C24 ( Figure S4b). The VarTE+ genes were on average more polymorphic than InvTE+ genes (Figure 2b; MWU p = 0.005), also in the 80 accessions dataset [17]. We conclude that TEs close to genes are less polymorphic, while genes close to polymorphic TEs are themselves more polymorphic.
A correlation between polymorphism levels of TEs and nearby genes is insufficient to address whether this is a direct link as opposed to high directional selection pressure on the genomic region in general. To address this question, we therefore compared the polymorphism level of TEs, the flanking regions and nearby genes. TEs in highly polymorphic regions are themselves more polymorphic than TEs in regions of low divergence ( Figure S11a; binomial test, p = 0), with the exception that TEs in highly polymorphic regions with nearby lowly polymorphic genes show a similar level of divergence as TEs in regions of low polymorphism with no coding genes. Moreover, TEs in gene-free regions show significantly higher divergence than TEs within 4 kb of a gene, especially if those genes are less polymorphic. TEs are generally more polymorphic than their flanking sequences (binomial test, p = 0), with the exception of TEs in highly polymorphic regions with lowly polymorphic gene. The results for large deletions ( Figure S11b) are consistent with our observation from Figure S1 that large deletions are over-represented in TEs compared to intergenic regions. Notably, there is no significant difference in the level of small-scale mutations between TEs and flanking regions ( Figure S11c). Taken together, TE variation through large deletions shows a positive correlation with flanking region polymorphism level, but is also strongly influenced by the conservation and presence/absence of nearby genes. The frequency of large deletions is however generally higher in TEs than in the flanking regions, indicating positive selection for large deletions within TEs. TEs, siRNAs, and their effects on expression of adjacent genes Genes that are close to TEs (TE+ genes) tend to have a lower expression average than TE2 genes in the Col-0 reference accession [15]. We set out to determine whether this was true for the accessions studied here as well. Gene expression was measured using Affymetrix tiling arrays and RNA extracted from floral tissue of each accession. We considered presence/absence of TEs in the flanking regions of genes, taking into account the number of linked TE insertions and the distance from each gene to the closest TE. We confirmed the reported pattern for Col-0 [15], and found that it applies to Bur-0 and C24 as well. In all three accessions, genes with proximal TEs (TE+ genes) were on average expressed at lower levels than those without proximal TEs (TE2 genes; Figure  Since TE superfamilies may have different effects on proximal genes, we examined gene expression according to the TE superfamily of the closest proximal TE. TE+ genes are expressed differentially depending on the TE superfamily of the proximal TE. TE+ genes with DNA transposons are on average expressed at a higher level compared to TE+ genes surrounded by retrotransposons ( Figure S12; MWU, p = 0.02 for Col-0, Bur-0 and C24). However, this is solely due to the higher expression level of genes proximal to CACTA elements. Indeed, we did not find evidence for CACTA TEs having any effect on gene expression ( Figure S12, MWU, p(CACTA TE+ genes/TE2 genes) = 0.7, 0.6 and 0.8 for Col-0, Bur-0 and C24, respectively), which may explain why they are on average closer to genes than TEs from other families. Within the retrotransposons, LTR retrotransposons are younger on average than non-LTR retrotransposons and have a greater suppressive effect on proximal genes (Table S2; [25]). Therefore TE superfamilies can differ considerably in their effects on proximal genes.
TEs suppress the expression of neighboring genes at least partially through DNA methylation, which in turn is linked to 24nt long siRNAs [12,15,22,26,27]. To investigate the influence of siRNAs on TE silencing, we sequenced siRNAs from mixed inflorescence tissue (shoot meristem plus flowers, stages 1-14) of each accession and mapped the reads to all possible positions of the respective genomes without any mismatches. As expected from previous work, the density of siRNAs over TEs was about four times higher than the genome average (Table S4; Figure S13).
We have reported before that siRNA-targeted TEs are more effective in suppressing expression of neighboring genes than are non-siRNA-targeted TEs, and that they are farther from genes [15]. We determined whether this held true in the current, more comprehensive dataset. If at least one 24-nt siRNA mapped to a TE it was labeled as siRNA+ (Table S5). siRNA+ and siRNA2 TEs were overall similar in number, but retrotransposons were targeted by siRNAs more frequently than DNA transposons ( Figure S14; binomial test, p = 0 for Col-0, Bur-0 and C24). siRNA+ TEs were farther from genes ( Figure 4a; Figure S15a; MWU p,2.2610 216 for Col-0, Bur-0 and C24), and this bias was consistent among TE superfamilies ( Figure S16). To examine the effects of siRNA-targeting on the expression of flanking genes, we classified genes by whether the nearest TE was siRNA+ or siRNA2 (Table S5). In each accession, genes flanked by siRNA+ TEs had lower average expression levels than genes with adjacent siRNA2 TEs ( Figure S15b; MWU p,2610 216 for Col-0, Bur-0 and C24). Therefore, as found previously for Col-0, siRNA-targeting of TEs represses nearby genes and TEs that are close to genes are less likely to be targeted by siRNAs, either due to stronger selection for deletion of siRNA-targeted TEs close to genes or selection against siRNA-targeting of these TEs.
Because siRNAs that map to unique positions in the genome (usiRNAs) correlate more closely with DNA methylation than siRNAs that map to multiple positions (msiRNAs; [12]), we investigated whether usiRNAs and msiRNAs target TEs differentially, and how usiRNA2 and msiRNA-targeted TEs might affect the expression of nearby genes. All TEs with at least one usiRNA were labeled as usiRNA+ (Table S5). In both Bur-0 and C24, over 83% of siRNA+ TEs were usiRNA+, similar to what has been reported for Col-0 [14]. usiRNA+ TEs were farther away from genes than msiRNA+ TEs ( Therefore, even though TEs targeted by usiRNAs and msiRNAs are on average farther from genes, they more strongly reduce expression of proximal genes compared to TEs targeted by only msiRNAs. Overall, we confirmed that siRNA+ TEs, especially usiRNA+ TEs, suppress neighboring gene expression, consistent with a trade-off between reduced TE mobility and deleterious effects on neighboring gene expression [14,15]. Links between variation in TEs, siRNA-targeting, and gene expression differences If TEs suppress the expression of adjacent genes, presence of gene-proximal TEs in the different accessions should be associated with differences in expression levels of proximal genes. We found that expression of TE2 genes varied less between accessions than TE+ genes, and further that expression varied less between genes proximal to invariant TEs (InvTE+ genes) than genes proximal to variant TEs (VarTE+ genes; Figure 5a; MWU p[TE2/ TE+],2610 216 , p[InvTE+/VarTE+] = 2610 25 ). However, because TEs, and especially VarTEs, are found more often next to polymorphic genes, these conclusions could be confounded by correlated differences in genic polymorphisms. We therefore classified genes based on the extent of sequence variation (Table  S6). Regardless of degree of genic polymorphism, VarTE+ genes were the ones that varied most in expression between accessions (Figure 5b), indicating that TE variation increases variance in gene expression. We next determined whether differential siRNA-targeting influences gene expression. To remove the potentially confounding effects of variation in TEs themselves, we focused on InvTE+ genes and grouped these based on whether siRNAs for the adjacent TE could be detected in either all or none of the three accessions, or whether accessions differed in siRNA-targeting of the adjacent TE. We found that while variation in siRNAtargeting increased expression differences between accessions, this increase was not statistically significant ( Figure 5a). It should be noted that in our analysis we could not distinguish between the effects of differential siRNA-targeting and any perturbations of cisregulatory sequences.
Since each TE that differs in presence/absence or each siRNAtargeting variant between accessions represents a natural mutagenesis experiment, this offers an opportunity to study the effects on individual genes, to confirm the inferences drawn from averaging over all genes. We selected siRNA+ TE+ genes in Col-0 that are siRNA2 TE+ or TE2 in Bur-0 or C24 and tested for differential expression between Bur-0 or C24 and Col-0. To remove the potential confounding effect of genic polymorphism, we excluded genes with a polymorphism level greater than 2%. Overall 706 genes were retained for this analysis. The effect of siRNA-targeting on gene expression was further verified by comparing expression profiles among wild-type, rdr2-1 and a ddc (drm1drm2cmt3) DNA methyltransferase triple mutant [28]. Fifteen genes out of 706 showed significant up-regulation (top 5% ranking) in Bur-0 or C24 and in at least one of the RNA silencing mutants (Table S7). Although not statistically significant, this observation is consistent with siRNA-targeting and TE presence affecting gene expression. Moreover, it is likely an underestimate of TE effects on gene expression, given our stringent selection criteria.

siRNA-targeting and TE evolution
Because siRNA+ TEs suppress neighboring gene expression particularly efficiently, we asked whether targeting of different regions of TEs was reflected in the expression of adjacent genes. We first investigated whether invariant and variant TEs (InvTEs and VarTEs) differed in siRNA-targeting, normalized by TE length, and whether there were differences between invariable and variable regions of VarTEs (Figure 6a; Table S8). Fewer siRNAs mapped to siRNA+ VarTEs than to siRNA+ InvTEs (Figure 6a This finding raised the question whether TE regions that varied between accessions and were targeted by siRNAs had a particularly large effect on expression of adjacent genes. We therefore separated Col-0 genes within 2 kb of variable TEs into three subsets: genes next to siRNA2 VarTEs (siRNA2 VarTE+ genes); genes next to VarTEs with an siRNA-targeting bias towards invariable TE regions (InvsiRNA+ VarTE+ genes); and genes next to VarTEs with an siRNAs targeting bias towards variable TE regions (VarsiRNA+ VarTE+ genes; Table S8). As expected, siRNA2 VarTE+ genes had a higher average expression level compared to InvsiRNA+ VarTE+ genes (Figure 6b

04). This indicates that gene suppression by neighboring
TEs may not only be influenced by siRNA presence or absence at the TEs, but may also depend on which TE regions are targeted by siRNAs. We speculate that siRNA-targeting of particular TE regions suppresses the expression of nearby genes to such an extent that there is significantly higher selection pressure for these regions to be excised or mutated. Alternatively, due to the skew of usiRNA mapping towards variable regions, and the greater correlation between usiRNAs and TE methylation, the lower expression level of VarsiRNA+ VarTE+ genes may reflect a higher degree of epigenetic silencing of these elements compared to InvsiRNA+ VarTE+ genes.

Discussion
TEs constitute the majority of DNA in many plant genomes [2,3]. Evolutionary dynamics vary among TE types and they are affected, for example, by species demography and mating system [29]. A number of measures counteract the proliferation of TEs including TE silencing and removal. Because TE deletions via illegitimate recombination and unequal intra-strand homologous recombination are common [30][31][32][33], it is important to understand how changes in TE composition affect nearby gene expression. We have studied the interactions of TE variants, genic polymorphism, gene expression, and siRNA-targeting in Arabidopsis thaliana. We have shown that there is substantial variation in TEs between accessions primarily through large deletions, with invariant TEs on average closer to genes than variant TEs. We have confirmed that gene expression is positively correlated with distance to the nearest TE, and negatively correlated with the number of proximal TEs. While variation within a TE has some effect on the expression of adjacent genes, genes close to TEs are also on average more polymorphic than those that are not. Perhaps our most interesting observation is the increased usiRNA-targeting in TE regions that are variable between accessions compared to TE regions that are invariant.

TE variation between accessions
TEs may be prevented from reaching fixation within a population through negative selection, especially for gene-proximal, methylated TEs [13,15,34]. Therefore, it is perhaps unsurprising that TEs are over-represented in analyses of structural variants among accessions and between species [17,18,35,36], and that a recent comparison of 80 A. thaliana genomes reported evidence of structural variation in 80% of TEs [17]. Similarly, Hollister and Gaut [15] found that 44% of over 600 TE insertions were polymorphic among 48 accessions. Since most TEs in A. thaliana are relatively old [7], the simplest way to explain these patterns is ongoing deletion of TEs, which is also consistent with TEs in A. thaliana being on average farther from genes than in the closely related but outcrossing A. lyrata [7]. This may, however, be too simplistic an explanation as non-LTR retrotransposons are skewed towards an older insertion distribution than LTR retrotransposons [25], even though they are not significantly more variable (Table  S2). While TE presence/absence polymorphisms in different accessions have been previously characterized [17], we have shown that there is substantial sequence variation in about 6% of TEs when comparing accessions (Figure 1a). These TE variants are equally distributed throughout the genome (Figure 1b).

TE effects on nearby genes
TEs can affect the expression of proximal genes via mechanisms including disruption of promoter sequences, reduction of transcription through the spread of epigenetic silencing [13], or read-though antisense transcription [37]. Often TEs suppress the expression of proximal coding genes [15,22,38] however, TEs can also introduce new promoter sequences, leading to up-regulation of proximal genes [37]. In both plants and animals, TE-derived sequences have been recruited to form regulatory sequences and have contributed to coding regions [8,[39][40][41][42].
Methylated TEs suppress expression of proximal genes in A. thaliana, regardless of insertion upstream or downstream of the coding region. Purifying selection is therefore greatest for methylated TEs proximal to genes [15]. Notably, the effects of siRNAs on expression of proximal genes can only be detected up to 400 bp [14], while measurable TE effects extend to 2 kb [14]. This supports the assertion that TEs either directly affect gene expression by disruption of positive regulatory sequences, or otherwise act through DNA structure and epigenetic marks to affect genes over longer distances.
We found that TEs that with variable siRNA-targeting do not affect proximal genes more strongly than TEs that are targeted in all three accessions ( Figure 5). It is possible that siRNA-targeting varies independently of TE sequence variation, as observed recently for DNA methylation [43], and that such TEs mask more subtle differences between the TE classes examined. However, the region of the TE targeted by siRNAs does seem to matter, with siRNAtargeting of TE sequences within an accession that are variant/ absent in other accessions showing a greater suppression of proximal genes ( Figure 6). This agrees with the observation that genes close to usiRNA-targeted TEs have a lower expression average than those close to msiRNA-targeted TEs, and that usiRNAs are over-represented in the variable regions of transposons. A recent study of hybrids between parents of different ploidy found that a reduction in 24 nt siRNAs is associated with upregulation of more TE-associated genes than when there is no significant change in siRNA levels [44]. This result supports the hypothesis that siRNAs, or linked epigenetic changes, can affect the expression of nearby genes, with deletion of the siRNA-targeted regions alleviating repression of adjacent genes.
While TEs in the euchromatin are often found close to genes, methylated TEs are underrepresented upstream of genes, likely because changes in the promoter more easily affect gene expression than variation in the 39 region [13]. In agreement, methylated TEs have a skewed distribution, with older elements farther from genes, but unmethylated TEs do not show such a bias [15]. In a comparison of humans and chimpanzees, TE insertion site preference appears to be the main cause for TEs being found more often in the vicinity of genes with increased interspecific expression variation [45]. This is reminiscent of what we have observed, with additive effects of polymorphism, TE presence and TE variance on the variability of orthologous gene expression (Figure 2 and Figure 4). In a comparison of two rice subspecies, TE presence/ absence polymorphisms were also found to be underrepresented in SNP deserts [35]. There are several possible explanations for these observations: some genomic regions may suffer from generally elevated mutation rates TEs near highly conserved genes are more efficiently purged; or TE integration into more mutable genomic regions is favored. In the latter case, new mutations may destabilize DNA packing and facilitate TE insertions, similar to the TE insertion preference for transcribed genomic regions [42].

TE evolution through silencing and deletions
With our observation of TE deletions correlating with siRNAtargeting, we can expand the current model for TE evolution [15]. Our model starts with the duplication of a TE that is already present and targeted by siRNAs within the genome (Figure 7a and  7b), leading to all siRNAs produced by and targeting the original TE now being multiply-mapping siRNAs (msiRNAs). As the two copies of the duplicated TE gain mutations (enhanced by deamination of methylated cytosines), uniquely-mapping siRNAs (usiRNAs) are produced in addition to msiRNAs (Figure 7c). Hollister and colleagues [14] noted that usiRNA-targeting increases with TE age, while msiRNA-targeting decreases, and that TEs are expressed at lower levels when also targeted by usiRNAs. Furthermore, usiRNAs are more closely correlated with DNA methylation than are msiRNAs [12] and they are expressed at higher levels than msiRNAs [14]. With usiRNAs, the duplicated TEs will therefore be more effectively silenced, probably with a concurrent increase in methylation, a further reduction in the expression level of proximal genes, and thus increased selection against the TEs.
usiRNA-targeting may then facilitate TE inactivation through preferential deletion of usiRNA-targeted regions ( Figure 6 and Figure 7d). This may be actively promoted by the usiRNAs and attendant epigenetic marks, in a mechanism analogous to the siRNA-guided removal of ''internal eliminated sequences'' including TEs in Tetrahymena [46,47]. In favor of such a scenario, small deletions within TEs have been shown to occur more frequently than ectopic recombination events at the LTRs [31,48]. Ectopic recombination appears to be less important for TE elimination in A. thaliana, as TE density and recombination rate are not correlated in this species [48], and because ectopic recombination is lower in homozygotes [49]. No matter what the mechanism, deletions within TEs would reduce selection pressure by removing usiRNA target sites, inactivating TEs so they are no longer transposition-competent, and relieving proximal gene repression.
In apparent contrast to the majority of TEs, some are under positive selection [50,51], and TEs can also contribute to new regulatory networks [52]. Our model is only appropriate for TEs under neutral or negative selection. Modeling of TE dynamics suggests that transposition events occur in a cyclical manner [53,54], with some activation events creating new favorable genetic variants. One such example is provided by transposition of a TE that is induced upon heat stress in genetic backgrounds impaired in siRNA biogenesis confers heat-responsiveness to proximal genes [55].

Conclusions
We have exploited high-quality genome information from multiple accessions of a single species to study the effects of TE variation on proximal gene expression. We discovered a link between siRNA-targeting and TE variation that illuminates how epigenetic mechanisms may help to shape genomes, but several questions remain: Do usiRNAs directly facilitate TE deletions or do they act indirectly through differences in selection for deletions? Are TE deletions in other species also associated with regions of increased usiRNA-targeting? And do species differ in the rate of TE deletion via this mechanism? Because of the rarity of TE deletions, this is a challenging process to dissect. Genomes with a large fraction of TEs, such as those of many crop plants, might therefore prove more tractable systems for studying mechanism of TE removal than the TE poor A. thaliana genome.

Annotation of genes and TEs in Col-0, Bur-0, and C24
We extracted positions of genes and TEs from the A. thaliana Col-0 genome sequence TAIR version 9 from http://www. arabidopsis.org. We excluded genes and TEs within the centromeric regions [56]. To define gene and TE sets in Bur-0 and C24, we built genome templates using published Illumina paired-end reads of Bur-0 and C24 [19]. We used the SHORE pipeline [21] to align the reads to the Col-0 reference genome and extracted the consensus sequences as genome templates by calling bases with quality.24, support.6, concordance.0.7 and average hits = 1. We then applied a naïve projection of the coordinates of genes and TEs from Col-0 onto the genome templates to define the gene and TE sets of Bur-0 and C24. SHORE was also used to detect genomic variations by calling SNPs, small (1-3 bp) insertions/deletions and larger deletions from the genome templates of Bur-0 and C24 compared to the Col-0 genome using the same parameters for quality control. The distance between TEs and genes in Bur-0 and C24 was estimated from Col-0 using the annotated TE and gene coordinates, and adjusted to account for insertions and deletions between TEs and genes.

Comparison of polymorphism densities
For each polymorphism type (i.e., SNPs, small indels, and large deletions), we compared the densities pairwise across coding regions, intergenic regions and TEs. To test whether a higher density was significant in a particular genomic region (e.g. TE) compared to others (e.g. coding region), a cumulative binomial probability distribution was applied: p is the polymorphism density in coding regions, and k and n are the total number of polymorphic sites in TEs and the total length of TEs, respectively. We calculated gene polymorphism levels as the fraction of genic region containing small-scale variations in at least C24 or Bur-0, or one of the 80 A. thaliana accessions [17]. Genes with more than 20% zero sequencing coverage or no base calls among 80 accessions were excluded from the analysis. 4 kb 59 and 39 flanking regions for each TE were extracted. For each flanking region (FR), or genic regions (GR) within the FR, small-scale mutations and large deletion polymorphisms between Col-0 and Bur-0/C24 were calculated. Using all mutations, the polymorphism levels of TEs, FRs and GRs were ranked. A threshold of 50% was used to split FRs and GRs into high or low polymorphism datasets and thereby classify the TEs by genomic environment. The polymorphism levels of the FRs were calculated in 200 bp bins for each group of TEs, with binomial tests to compare polymorphism levels between TEs and FRs, and between different TE groups.

Gene expression
Inflorescences (meristem and flowers up to stage 14) were pooled from five plants of each accession grown at 23uC. Triplicate samples were collected between 7 and 8 hours into a 16 hour light cycle. RNA was extracted using the Qiagen (Hilden, Germany) Plant RNeasy Mini kit. Each biological replicate was analyzed with Affymetrix (Santa Clara, CA, USA) tiling 1.0R arrays and the data were processed according to published methods [57,58]. Tiling array probes that were polymorphic for C24 or Bur-0 were removed from the dataset for the affected accession(s). For gene expression estimates, $70% and at least 3 probes had to be present; all other genes were not considered.
Tilling array data from Arabidopsis Col-0 and the RNA silencing mutants rdr2-1 and ddc (drm1-1;drm2-2;cmt3-11) mutants were downloaded from GEO (GSE12549; [28]) and processed according to published methods [57,58]. Expression level changes for each dataset were estimated by fold-change differences between Bur-0/C24 and Col-0, and between the RNA silencing mutants and wild type Col-0. Background distributions of fold-change were calculated and genes, with a fold-change exceeding a one-sided 95% quantile in each dataset were considered as significantly upregulated in Bur-0/C24 or the mutants.

siRNA analyses
The siRNA datasets have been published [19] (GEO accession number GSE24569). We mapped the 24-nt siRNA reads onto both strands of the genome templates (see below) and the TEs of Col-0, Bur-0 and C24, respectively, using the Vmatch package (http://www.vmatch.de). Only reads with perfect matches were considered.

Comparison of usiRNA-and msiRNA-targeting
The statistical significance of over-representation of usiRNAs or msiRNAs within the variable regions of siRNA+ VarTEs in comparison to all siRNAs was tested using the cumulative binomial probability distribution given above. p, expected frequency, is the ratio between the number of siRNAs mapped to the variable regions the total number of siRNAs mapped to any region of siRNA+ VarTEs, and n and k are the total number of usiRNAs/msiRNAs mapped to any region and the number of usiRNAs/msiRNAs mapped to the variable regions, respectively.

Determination of InvsiRNA+ and VarsiRNA+ VarTEs
We defined an siRNA+ VarTE as either InvsiRNA+ or VarsiRNA+ if siRNAs are overrepresented in the invariable regions and variable regions, respectively. For siRNA+ VarTEs that contain siRNAs in both variable and invariable regions, we employed the cumulative binomial probability distribution described above to test whether siRNA-targeting shows statistically significant bias towards variable or invariable regions. For each siRNA+ VarTE, p in the formula above is the abundance of siRNA-targeting at the TE. To test the bias towards variable regions, n and k represent the genomic length of variable regions and the number of siRNAs targeting variable regions, respectively. Similarly, to test the bias towards invariable regions, n and k represent the genomic length of invariable regions and the number of siRNAs targeting invariable regions, respectively. Pvalues were adjusted for multiple hypothesis testing with the Benjamini-Hochberg method to control for a false discovery rate of 5% [59].

Data deposition
The siRNA and microarray data reported in this paper have been deposited in the National Center for Biotechnology Figure S13 siRNA-targeting of non-centromeric TEs. siRNAtargeting of non-centromeric genomic and TE regions in Col-0, Bur-0 and C24. The abundance of siRNA in TEs and genomewide is defined as the total number of mapped siRNA reads, normalized by total TE and genome length, respectively (see Table  S5). (TIF) Figure S14 TE superfamilies and siRNA-targeting. The fraction of TEs that are siRNA+ in each TE superfamily for each accession; Col-0 (a), C24 (b), or Bur-0 (c). Binomial test: p = 0 for Col-0, Bur-0 and C24. (TIF) Figure S15 Relationship of TE siRNA-targeting to gene proximity and the effect on gene expression in Col-0, Bur-0 and C24. (a) The average distance of siRNA2 (red) and siRNA+ (yellow) proximal TEs to the nearest genes. For siRNA+ proximal TEs, distances to the closest gene are compared between msiRNA+ TEs (cyan) and usiRNA+ TEs (navy). (b) Average expression level of genes when neighboring TEs are siRNA2 (red) or siRNA+ (yellow). For siRNA+ TEs, average neighboring gene expression levels are given for when the TEs are distal (greater than 2 kb from gene; dark gray) or proximal (within 2 kb; light gray). For genes with proximal siRNA+ TEs, expression levels are further compared for msiRNA+ TEs (cyan) vs usiRNA+ TEs (navy). The number of expressed genes used in each analysis is given. MWU: ** = p,0.01. (TIF) Figure S16 siRNA-targeting of TEs and TE proximity to genes by TE superfamily. Average distance to the nearest gene compared between siRNA+ and siRNA2 proximal TEs for each TE superfamily for the three accessions (a-c). (TIF) Table S1 TE variation by chromosomal position. The number of TEs, average TE variation and fraction of variant TEs between Col-0 and Bur-0/C24 are summarized depending on TE proximity to genes on chromosomes arms and pericentromeric regions. SE = standard error. (DOCX) Table S2 TE variation and proximity to genes. The number, average size, average distance to the nearest gene, degree of TE variation, insertion site preference and TE average age summarized by TE superfamily. (*) Rank is presented as descending TE distance to the nearest gene and degree of TE variation (MWU: pvalue,0.05). (**) Average age is given for each superfamily where possible. Mean average age for all A. thaliana TEs is 11.0 million years [25]. (DOCX) Table S3 TE and gene numbers for each accession. The number of total and non-centromeric TEs and genes is summarized. The number of genes sorted by TE proximity and TE variation is also given, along with the total number of expressed non-centromeric genes. (DOCX) Table S4 siRNA mapping statistics. Twenty-four nt siRNA reads that map to non-centromeric sequences in Col-0, Bur-0 and C24. (DOCX) Table S5 siRNA-targeting of TEs. TEs according to siRNAtargeting and siRNA mapping uniqueness. The number of genes is also given according to whether or not the closest TE is targeted by siRNAs. (DOCX)