Skip to main content
Advertisement
  • Loading metrics

Transposon dynamics in the emerging oilseed crop Thlaspi arvense

  • Adrián Contreras-Garrido ,

    Contributed equally to this work with: Adrián Contreras-Garrido, Dario Galanti

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany

  • Dario Galanti ,

    Contributed equally to this work with: Adrián Contreras-Garrido, Dario Galanti

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Plant Evolutionary Ecology, University of Tübingen, Tübingen, Germany

  • Andrea Movilli,

    Roles Conceptualization, Data curation, Resources, Validation, Writing – review & editing

    Affiliation Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany

  • Claude Becker,

    Roles Funding acquisition, Investigation, Resources, Writing – original draft, Writing – review & editing

    Affiliation LMU Biocenter, Faculty of Biology, Ludwig Maximilians University Munich, Martinsried, Germany

  • Oliver Bossdorf,

    Roles Conceptualization, Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Plant Evolutionary Ecology, University of Tübingen, Tübingen, Germany

  • Hajk-Georg Drost ,

    Roles Conceptualization, Investigation, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    drost@tue.mpg.de(H-GD), weigel@tue.mpg.de (DW)

    Affiliation Computational Biology Group, Max Planck Institute for Biology Tübingen,Tübingen, Germany

  • Detlef Weigel

    Roles Conceptualization, Funding acquisition, Investigation, Project administration, Resources, Supervision, Validation, Writing – original draft, Writing – review & editing

    drost@tue.mpg.de(H-GD), weigel@tue.mpg.de (DW)

    Affiliation Department of Molecular Biology, Max Planck Institute for Biology Tübingen, Tübingen, Germany

Abstract

Genome evolution is partly driven by the mobility of transposable elements (TEs) which often leads to deleterious effects, but their activity can also facilitate genetic novelty and catalyze local adaptation. We explored how the intraspecific diversity of TE polymorphisms might contribute to the broad geographic success and adaptive capacity of the emerging oil crop Thlaspi arvense (field pennycress). We classified the TE inventory based on a high-quality genome assembly, estimated the age of retrotransposon TE families and comprehensively assessed their mobilization potential. A survey of 280 accessions from 12 regions across the Northern hemisphere allowed us to quantify over 90,000 TE insertion polymorphisms (TIPs). Their distribution mirrored the genetic differentiation as measured by single nucleotide polymorphisms (SNPs). The number and types of mobile TE families vary substantially across populations, but there are also shared patterns common to all accessions. Ty3/Athila elements are the main drivers of TE diversity in T. arvense populations, while a single Ty1/Alesia lineage might be particularly important for transcriptome divergence. The number of retrotransposon TIPs is associated with variation at genes related to epigenetic regulation, including an apparent knockout mutation in BROMODOMAIN AND ATPase DOMAIN-CONTAINING PROTEIN 1 (BRAT1), while DNA transposons are associated with variation at the HSP19 heat shock protein gene. We propose that the high rate of mobilization activity can be harnessed for targeted gene expression diversification, which may ultimately present a toolbox for the potential use of transposition in breeding and domestication of T. arvense.

Author summary

Transposable elements (TEs) are often considered genomic parasites, but they can also generate phenotypic novelty that helps organisms to adapt to new environments. To understand how TEs might contribute to phenotypic diversity and adaptive potential in the emerging oilseed crop Thlaspi arvense (field pennycress), we examined the dynamics of TE variation in a geographically diverse sample of this species. By surveying almost 300 wild accessions from North America and Eurasia we discovered over 90,000 polymorphic TE insertions. We identified not only genetic factors that vary between populations and that are associated with TE mobilization, but also TE families that are most likely to generate genetic diversity of interest to breeders.

Introduction

Transposable elements (TEs) are often neglected, mobile genetic elements that make up large fractions of most eukaryotic genomes [1]. In plants with large genomes, such as wheat, TEs can account for up to 85% of the entire genome [2,3]. Due to their mobility, TEs can significantly shape genome dynamics and thus both long- and short-term genome evolution across the eukaryotic tree of life. TEs are typically present in multiple copies per genome and they are broadly classified based on their replication mechanisms, as copy-and-paste (class I or retrotransposons) or cut-and-paste (class II or DNA transposons) elements. The two categories can be broken down into superfamilies based on the arrangement and function of their open reading frames [4]. Further distinctions can be made based on the phylogenetic relatedness of the TE encoded proteins [5,6]. To minimize the mutagenic effects of TE mobilization, host genomes tightly regulate TE load through an array of epigenetic repressive marks that suppress TE activity [79].

While epigenetic silencing of TEs is important for the maintenance of genome integrity and species-specific gene expression, TE mobilization can also generate substantial phenotypic variation through changing the expression of adjacent genes, either due to local epigenetic remodeling or direct effects on transcriptional regulation [10]. Because TE activity is often responsive to environmental stress [1113] and other environmental factors [1417], it has been proposed that it could be used for speed-breeding through externally controlled transposition activation [18].

Thlaspi arvense, field pennycress, yields large quantities of oil-rich seeds and is emerging as a new high-energy crop for biofuel production [1921]. As plant-derived biofuels can be a renewable source of energy [22], the past decade has seen efforts to domesticate this species and understand its underlying genetics in the context of seed development and oil production. Thlaspi arvense is particularly attractive as a crop because it can be grown as winter cover during the fallow period, protecting the soil from erosion [19]. Natural accessions of T. arvense are either summer or winter annuals, with winter annuals being particularly useful as potential cover crop [23]. Native to Eurasia, T. arvense was introduced and naturalized mainly in North America [24].

As a member of the Brassicaceae family, T. arvense is closely related to the oilseed crops Brassica rapa and Brassica napus, as well as the undomesticated model plant Arabidopsis thaliana [25]. A large proportion of the T. arvense genome consists of TEs [26], and TE co-option has been proposed as a mechanism particularly for short-term adaptation and as a source of genetic novelty [27]. As in many other species, differences in TE content is likely to be a major factor for epigenetic variation as well, especially through remodeling of DNA methylation [28].

Here, we use whole-genome resequencing data from 280 geographically diverse T. arvense accessions to characterize the inventory of mobile TEs (the ‘mobilome’), TE insertion patterns of class I and class II elements and their association with variation in the DNA methylation landscape. We highlight a small TE family with preference for insertion near genes, which may be particularly useful for identifying new genetic alleles for T. arvense domestication.

Results

Phylogenetically distinct transposon lineages shape the genome of T. arvense

To be able to understand TE dynamics in Thlaspi arvense, we first reanalyzed its latest reference genome, MN106-Ref [26]. In total, 423,251 transposable elements were categorized into 1984 unique families and grouped into 14 superfamilies (S1 Table), together constituting 64% of the ~526 Mb MN106-Ref genome. Over half of the genome consists of LTR (Long Terminal Repeat)-TEs. Using the TE model of each LTR family previously generated by structural de novo prediction of TEs [26], we assigned 858 (~70%) of the 1,205 Ty1 and Ty3 LTR-TEs to known lineages based on the similarity of their reverse transcriptase domains [5] (Fig 1A).

thumbnail
Fig 1. Genome-wide distribution and classification of TE families and superfamilies in the T. arvense reference genome MN106-Ref.

(A) Phylogenetic tree of LTR retrotransposons based on the reverse transcriptase domain. (B) Genome-wide distribution of TE family and superfamily abundances. The tracks denote, from the outside to the inside, (1) protein-coding loci, (2) Athila, (3) Retand, (4) CRM, (5) Tekay, (6) Reina, (7) Ale, (8) Alesia, (9) Bianca, (10) Ivana, (11) all DNA TEs. (C) Evolutionary age estimates of intact copies of autonomous versus non-autonomous TE families. P-value is computed based on performing a Wilcoxon Rank Sum test. (D) Total number of intact TEs in different lineages. (E) Distribution of insertion time estimates for intact LTR elements across different LTR TE lineages (shown if number of intact TEs was greater than 10).

https://doi.org/10.1371/journal.pgen.1011141.g001

The most abundant LTR-TE lineage in T. arvense is Ty3 Athila (S2 Table) with ~180,000 copies, 10-fold more than the next two most common lineages, Ty3 Tekay (~57,000) and Ty3 CRM (~30,000). The most abundant Ty1 elements belonged to the Ale lineage, with 108 families, while the Alesia and Angela lineages were represented only by one family each (S2 Table).

Next, we compared the genomic distribution of lineages within the same TE superfamily (Fig 1B). In the Ty1 superfamily, CRM showed a strong centromeric preference, whereas Athila was more common in the wider pericentromeric region. In the Ty1 superfamily, Ale elements were enriched in centromeric regions, whereas Alesia showed a preference for gene-rich regions.

Thlaspi arvense LTR retrotransposons present signatures of recent activity

To assess the potential and natural variation of TEs transposition across accessions, we used the complete set of protein domains identified for a respective TE model to classify each family as either potentially autonomous or non-autonomous (METHODS). About 60% of all TE families (1,260 out of 2,038) encoded at least one TE-related protein domain, but only about a quarter had all protein domains necessary for transposition, and we classified these 537 families as autonomous. Autonomous TE families had on average more and longer copies than non-autonomous ones, although both contributed similarly to the total TE load in the genome (S1 Fig). Next, we focused on individual, intact LTR-TE copies, since they are often the source of ongoing mobilization activity (13)(18)(56). Overall, the 193 autonomous LTR-TE families had more members without apparent deletions than the 1,027 non-autonomous LTR-TE families (2,039 versus 339). Intact LTR-TEs from autonomous families tended to be evolutionarily younger and more abundant than their non-autonomous counterparts (Fig 1C). As for lineages, Athila was the lineage with the most intact members, followed by Tekay and CRM (Fig 1D), although estimates of insertion times revealed Ale and Alesia Ty1 lineages as actors of the most recent transposition bursts (Fig 1E).

TE polymorphisms in a collection of wild T. arvense populations

Our analysis of the MN106-Ref reference indicated that a substantial part of the genome consists of autonomous TE families. To learn how TE mobility has shaped genomes at the species level, we surveyed differences in TE content in a large collection of natural accessions. We compiled whole-genome sequences of 280 accessions from different repositories (S3 Table), covering twelve geographic regions, and much of the worldwide distribution of T. arvense in its native range and in regions where it has become naturalized (Fig 2A).

thumbnail
Fig 2. The genome-wide landscape of TE insertion polymorphisms in T. arvense.

(A) Distribution of accessions across their native Eurasian and naturalized North American range in the Northern hemisphere (omitting a sample from Chile, included in the Americas group). (B) A SNP-based principal component analysis (PCA) of all accessions, with color code as in (A). Due to the fact that the accessions contributing to the Armenian cluster are separated from the other geographic populations, we recalculated a PCA without the Armenian samples as shown in S4 Fig. (C) Allele-frequency spectrum of TIPs (blue) and SNPs (red). (D) Cumulative sums of unique insertions per region as a function of sampled accessions. (E) Average TIP frequencies over 100 kb windows along the genome, compared to gene and TE densities, also displayed in 100 Kb windows. Map source: naturalearthdata.com.

https://doi.org/10.1371/journal.pgen.1011141.g002

We first characterized the population structure of this collection with a subset of high-confidence SNPs and short indels that we used to cluster the accessions by principal component analysis (PCA) (Fig 2B) (Methods). We also constructed a maximum likelihood tree without considering migration flow for these populations, using the two sister species Eutrema salsugineum and Schrenkiella parvula as an outgroup (S2 Fig). North American accessions clustered together with European accessions, in support of T. arvense having been introduced to North America from Europe. Chinese accessions formed a separate cluster, but the most isolated cluster was composed of Armenian accessions, as it has been reported previously [20,26].

Next, we screened our data for TE insertion polymorphisms (TIPs), i.e., TEs not present in the reference genome assembly. This will in most cases be due to insertions that occurred on the phylogenetic branch leading to the non-reference accession, although it formally could also be the result of deletion or excision events of a shared TE on the branch leading to the reference accession.

We detected 18,961 unique insertions, which were unequally distributed among populations, with an excess of singletons (5,617 singletons) (Fig 2C). The allele frequency of TIPs was on average lower than that of SNPs (Fig 2C), with the caveat that detection of TIPs may incur more false negatives. Saturation analysis (Fig 2D) indicated that we were far from sampling the total TE diversity in T. arvense, especially in Armenian and Chinese accessions. Taken at face value, the disparity in singleton frequencies between TIPs and SNPs would suggest either that TIPs are on average evolutionarily younger than SNPs, or that there is stronger selection pressure against TE insertions [29] (Fig 2C). What speaks against the latter view is that TIPs in the gene-rich fraction of the genome, near the telomeres, have higher allele frequencies (Fig 2E), while TIPs in the pericentromeric regions are more abundant, but have lower allele frequencies (see S3 Fig for a statistical assessment).

We complemented our analysis of TIPs with a corresponding analysis of TE absence polymorphisms (TAPs), which we define as TEs that are found in the reference assembly but missing from other accessions. This could be due to insertions having occurred on the phylogenetic branch leading to the reference accession or excisions of DNA TEs by a cut-and-paste mechanism. TAPs were detected using a custom TAP annotation pipeline (METHODS).

Overall, a comparison of TIPs and TAPs distributions by PCA showed Armenian accessions to be clear outliers, with all other accessions clustering closely together (Figs 2B and S5), indicating that most of the observed TE variation reflects the population structure observed with SNPs. As with SNPs, Armenian accessions harbor the largest number of both TIPs and TAPs. If we look at the impact of these polymorphisms on the genomic landscape (Fig 3A), we find a major hotspot of TAPs in chromosome 4 for a subset of accessions from Southern Sweden. There also appears to have been major insertion activity in the clade leading to the reference accession, as indicated by the high density of reference insertions missing in all other populations at the ends of chromosomes 4 and 5. For both TIPs and TAPs, the major source of TE polymorphisms comes from activity of Ty3 LTRs (RLGs), especially Ty3 Athila (Fig 3B). Many other TE families contributed to both TIPs and TAPs as well, with 1,203 families having at least one TIP, and 1,268 having at least one TAP. The more distant a population is geographically from the reference, the greater the contribution of non-autonomous families to the TIP load, with the exception of Northern Germany (Fig 3C).

thumbnail
Fig 3. The T. arvense mobilome.

(A) Genomic distribution of TIPs and TAPs in chromosomes 4 and 5, where we observe major TIP/TAP hotspots. TIPs and TAPs along the other chromosomes are shown in S6 Fig. (B) Contribution of different superfamilies to transposon insertion polymorphisms (TIPs) and transposon absence polymorphisms (TAPs). (C) Frequencies of autonomous and non-autonomous TE-derived TIPs in different geographic regions. (D) Average count of TIPs per individual for the five TE families with the highest contribution to either TIPs or TAPs in each geographic region. For all figure panels, the gray box illustrates the color scheme for the geographical populations and for autonomous/non-autonomous families.

https://doi.org/10.1371/journal.pgen.1011141.g003

Across all populations, most TE activity was due to a small set of 25 TE families, with the Athila lineage standing out in particular (Fig 3D). For highly active TE families, TIPs were more diverse than TAPs, as the latter were predominantly driven by LTR retrotransposons.

Host control of TE mobility

In A. thaliana, natural genetic variation affects TE mobility and genome-wide patterns of TE distribution, driven by functional changes in key epigenetic regulators [14,3032]. The rich inventory of TE polymorphisms in T. arvense offered an opportunity to investigate the genetic basis of TE mobility in a species with a more complex TE landscape. We tested for genome-wide association (GWA) between genetic variants (SNPs and short indels) and TIP load of different TE classes, TE orders and TE superfamilies [4]. We found several GWA hits next to genes that are known to affect TE activity or are good candidates for being involved in TE regulation (Fig 4A–4D). The results differed strongly between class I and class II TEs: while class I TEs were associated with a wide range of genes encoding mostly components of the DNA methylation machinery (Fig 4A–4D), class II TEs were mostly associated with allelic variation at an ortholog of O. sativa HEAT SHOCK PROTEIN 19 (HSP19). Only class I TE superfamilies were enriched for significant associations close to DNA methylation machinery genes (Fig 4B), and this difference was consistent for most superfamilies that belonged to either class I or class II (S7 Fig). The most prominent hits for class I TIPs were near orthologs of A. thaliana BROMODOMAIN AND ATPase DOMAIN-CONTAINING PROTEIN 1 (BRAT1), which prevents transcriptional silencing and promotes DNA demethylation [7], and components of the RNA-directed DNA methylation machinery such as DOMAINS REARRANGED METHYLTRANSFERASE 1 (DRM1), ARGONAUTE PROTEIN 9 (AGO9) and DICER LIKE PROTEIN 4 (DCL4) [33] (Figs 4A–4D, S7 and S8). Another category of genes that emerged in our GWA are genes encoding DNA and RNA helicases such as RECQL1 and 2 (Figs 4 and S8). Some of our GWA peaks extend over several genes and might reflect associations with less well characterized genes, but others have the strongest associations in individual genes such as HSP19 and BRAT1 (S8 Fig). For HSP19, the top SNPs are located in introns and it is difficult to predict their effect. BRAT1 has two highly significant, fully linked SNPs in exons 1 and 4. The SNP in exon 4 (Chr1:63627484) introduces a stop codon that removes part of the ATPase domain and the entire chromatin binding bromodomain, and this mutation almost certainly completely eliminates BRAT1’s anti-silencing activity [7].

thumbnail
Fig 4. GWA analysis for TIP load of a class I and a class II TE superfamily.

Results including all superfamilies are shown in S7 Fig. (A) Manhattan plots with candidate genes indicated next to neighboring variants. The red line corresponds to a genome-wide significance with full Bonferroni correction, the blue line to a more generous threshold of –log(p) = 5. (B) Enrichment and expected FDR of a priori candidate DNA methylation machinery genes, for stepwise significance thresholds [28,34]. (C) Shown are the allelic effects of the red-circled variants from the corresponding Manhattan plots on the left. (D) Shown are the candidate genes marked in A, their putative functions and distances to the top variant of the neighboring peaks. Blue font denotes DNA methylation machinery genes included in the enrichment analyses. (E) DNA methylation around class I and class II TIPs in carrier vs. non-carrier individuals.

https://doi.org/10.1371/journal.pgen.1011141.g004

Since accessions that diverged earlier from the reference had potentially more time to accumulate TIPs, we also estimated the age of all insertions [14] and repeated the GWA using only TIPs younger than 500,000 years. The results were similar to using all TIPs, suggesting that this potential reference bias is unlikely to drive any of the identified associations (S9 Fig).

To further confirm the association between the DNA methylation pathway and class I TE polymorphisms, we used published bisulfite sequencing data to quantify methylation levels of the neighboring regions of TIPs [28]. In all three epigenetic contexts (CG, CHG, CHH; where H stands for all three nucleotides but G), we found a significant increase of methylation up to 1 kb around class I, but not around class II TE insertions (Fig 4E). Taken together, we interpret these results such that class I TE mobility is primarily controlled by the DNA methylation machinery, leading to RdDM spreading around novel insertions, thus creating substantial epigenetic variation beyond TE loci.

An autonomous Alesia LTR family with insertion preference for specific genomic regions

Our characterization of the T. arvense mobilome revealed a strikingly uneven distribution of one autonomous LTR Ty1 family belonging to the Alesia lineage, Alesia.FAM.7. This family encompasses 144 elements in the reference genome, 51 of which are complete copies. Despite being a relatively small TE family, 44 copies are close to genes (< 1 kb), of those, 8 copies are within genes (S3 Table). Across all 4,215 Alesia.FAM.7 TIPs, that is insertions not present in the reference genome, we found a strong enrichment nearby and within genes, which was the case for ~75% of all insertions (Fig 5A and 5B). The genes potentially affected by these insertions were involved in a wide range of functions, including metabolism and responses to biotic and abiotic factors (Fig 5C). Reference insertions were rarely missing in other accessions, except an intronic reference insertion that was detected as absent in some Swedish accessions. The prevalence of Alesia.FAM.7 TIPs near genes suggests that the skewed distribution in the reference is not so much due to removal of insertions in other regions, but that it reflects an unusual insertion site preference of this family across all examined accessions.

thumbnail
Fig 5. Summary statistics and characterization of the Alesia.FAM.7 family in T. arvense and other Brassicaceae.

(A) Distribution of several TE families across different genomic contexts in T. arvense accessions. While several other families, such as MuDR.FAM.140 or CRM.FAM.215, are also often found in introns, Alesia.FAM.7 is the only family that is commonly inserted in coding sequences. (B) Distribution of several LTR lineages along chromosome 1 in MN106-Ref. (C) GO enrichment of genes associated with Alesia.FAM.7 TIPs. (D) Phylogenetic tree of Alesia.FAM.7 related copies across different Brassicaceae. (E) Structure of the Alesia.FAM.7 model: 5’ Long terminal repeat (LTR); primer binding site (PBS), a tRNA binding site, in this case complementary to A. thaliana methionine tRNA; Gag domain; Pol domains: Protease (Prot), Integrase (Int) and the two subdomains of the reverse transcriptase, the DNA polymerase subdomain (Rvt2) and the RNase H subdomain (RNAseH1); polypurine tract (PPT). The location of a putative heat responsive element (HRE) with the four-nGAAn motif in the LTR is indicated by a purple segment.

https://doi.org/10.1371/journal.pgen.1011141.g005

Alesia.FAM.7 is highly similar to the Terestra TE family, first described in A. lyrata [35]. The Terestra family, which has been reported in six Brassicaceae, is heat responsive due to a transcription factor binding motif also found in A. thaliana ONSEN, where it can be bound by heat shock factor A (HSFA2) via a cluster of four nGAAn motifs called heat responsive elements (HRE) [12]. In Alesia.FAM.7, we found a similar four-nGAAn motif cluster in most copies in the 5’ LTR portion of the elements (Fig 5D). A search against the NCBI NT database [36] revealed the presence of this TE family, with an Alesia-diagnostic reverse transcriptase sequence signature, in several additional Brassicaceae (Fig 5E), notably B. rapa, B. napus, B. oleracea, Raphanus sativus, and other Arabidopsis species, but not in A. thaliana. It is conceivable that this heat-responsive, euchromatophilic Alesia family rewires gene regulatory networks between and within Brassicaceae species. We conducted a similar search of a subset of TE families against the NCBI NT database (S10 Fig) and Alesia.FAM.7 was indeed the only deeply conserved TE family with evidence for recent activity.

Discussion

Although A. thaliana and T. arvense are close relatives, with evolutionary divergence estimates of 15–24 million years ago [27] and similar life histories in terms of demographic dynamics, geographic expansion, and niche adaptation [25,37], their genomes are very different, one key difference being the significantly higher TE load of the T. arvense genome. Exploring the diversity and dynamics of mobile elements in such TE-rich genomes enables a better understanding of the evolution of genome architecture. Here, we report how TEs drive genome variation in T. arvense by analyzing the diversity and phylogenetic relationships of TEs, as well as their autonomous status, ongoing activity, and contrasts between biogeographic populations.

Many recent studies have confirmed that several TE families do not insert randomly in the genome, and that their apparent enrichment in specific portions of the genome, such as centromeres, is not simply due to purifying selection [38]. Many TEs have clear insertion site preference [39], both driven by primary DNA sequence and by epigenetic marks, e.g. Ty1 insertions in A. thaliana are biased towards regions enriched in H2A.Z [40]. Our results confirm this view whereby the phylogenetic nature of an LTR element plays a role in the observable genome-wide insertion pattern in T. arvense. Within the Ty1 elements, Ale elements are preferentially centrophilic whereas Alesia elements are enriched in the genic regions of the genome. For the Ty3 elements, The Retand clade does not show any particular preference across the chromosome, while CRM are centrophilic and Athila insertions are often found in pericentromeric regions. Thus, a phylogenetic classification of TEs, alongside the classification into autonomous and non-autonomous elements, is key to understanding TE dynamics, especially in LTR retrotransposon-rich genomes.

We learned that one third of the T. arvense genome consists of Ty3/Athila LTR-TEs, which is considerably more than in other Brassicaceae, such as A. thaliana and Capsella rubella, where Ty1/Ale elements are the most abundant TE lineage [41]. This suggests that a single or multiple ancient Athila bursts may underlie genome size expansion in T. arvense. This is in line with the expansion of the Ty3 LTR-TE superfamily, to which Athila belongs, in Eutrema salsugineum [42], from which T. arvense diverged 10–15 million years ago [43]. Similar Ty3 associated expansions have been reported, for example, for Capsicum annuum (hot pepper) [44].

Having established substantial variation in TE content among natural accessions, we asked whether there is also genetic variation for control of TE mobility, as is the case for A. thaliana [14,30,31]. Perhaps not too surprisingly, the sets of genes associated with TE mobilization appear to depend on the nature of the TE transposition mechanism. While variation in retrotransposon insertions was strongly associated with several genes involved in the DNA methylation machinery, DNA transposon insertions were instead associated with a single Heat Shock Protein 19 (HSP19) gene, and this was consistent across different class I (retrotransposon) and class II (DNA transposon) superfamilies. Although studies in A. thaliana have highlighted differences in the genetic control of methylation and mobility of the two classes of transposons, GWA for CHH methylation of TE families did not produce very different signals for class I and II families [45]. The same was true for TIP-counts of different families and superfamilies as phenotypes [14,32]. Since HSP19 is an ortholog of an O. sativa gene that is absent from the A. thaliana reference genome, it is possible that this gene is providing new functionality in T. arvense. What this functionality might be is difficult to answer with our data, but different types of HSPs are involved in DNA methylation-dependent silencing of genes and TEs in A. thaliana [46], and in controlling transposition in several other organisms [4749]. That class I and class II TEs in T. arvense apparently differ in their genetic requirements for silencing can be potentially linked to our observation that DNA methylation spreads more rapidly from class I than class II TE insertions in this species.

The contrast between Alesia and Athila lineages suggests that TEs may be more than detrimental genome parasites. There are many examples from animals and plants of both TE proteins and TEs themselves having been domesticated and thereby enriching genome function [38,5052]. While parasitic TEs may constitute the majority of TEs within a given species, there can be different life cycle strategies adopted by TEs [53]. With respect to notable TE families in T. arvense, Alesia’s gain of HREs might provide a unique selection advantage, allowing it to survive more easily in the genome, as long as copy numbers are low, in a relationship with the host that resembles other forms of symbiotic lifestyle. Further research of this enigmatic Alesia lineage, which is found in many angiosperms [41], could enhance our understanding of the different strategies used by TEs to persist over long evolutionary time scales.

Turning to more practical matters, it might be possible to exploit the preference of Alesia.FAM.7, which is conserved in several Brassicaceae species, for genic insertions as a source of fast genic novelty for crop improvement. TE insertions in exons might disrupt genes, while intronic insertions might modulate alternative splicing or reduce the accumulation of correctly spliced transcripts. An example is provided by A. thaliana accessions in which an intronic COPIA insertion in the disease resistance gene RPP7 shifts the balance between full-length and truncated transcripts [54]. It would therefore be useful to determine how easily Alesia.FAM.7 can be mobilized by heat in T. arvense, and conversely, whether heat responsiveness might also be a source of unwanted genetic variation in breeding programs.

Methods

Dataset summary

For the investigation of T. arvense natural genetic variation (TIPs, TAPs, and short variants), we leveraged Illumina short read data from three studies [26,28,43]. The largest survey investigated both genetic and DNA methylation variation in 207 European accessions (13 from the Netherlands, 16 from the South of France, 42 from the South of Germany, 52 from the North of Germany, 48 from the South of Sweden and 40 from Middle Sweden). In addition, we used data from 39 Chinese accessions (10 each from Xi’an, Zuogong, and Hefei and 9 from MangKang) [43], 21 from the US, and one each from Chile and Canada [26]. For most of the European accessions, Illumina whole-genome bisulfite-sequencing (BS-seq) data were available as well [28] (S3 Table). We used as reference the assembly generated in [26], together with the gene and TE annotation also generated in that study. To visualize the accession locations in the world map, we used free vector and raster map data from naturalearthdata.com. We reinforced this dataset by sequencing 12 different accessions, 7 Armenian and 5 European, using Illumina paired-end 2x150 bp WGS (S3 Table). Briefly, we grew plants in soil, collected fully developed rosette leaves, snap-froze them in liquid nitrogen and disrupted the tissue to frozen powder. We extracted genomic DNA and prepared Illumina libraries as described before [28]. To validate our TIP analysis we also sequenced our samples using long read HiFi PacBio technology for a single Armenian accession (Ames32867/TA_AM_01_01_F3_CC0_M1_1). For the ancestry analysis, we used two assemblies for Eutrema salsugineum and Schrenkiella parvula (NCBI ID: PRJNA73205; Phytozome genome ID: 574 respectively) as outgroup species.

TE analysis of the reference genome

To resolve phylogenetic relationships of the LTR-TEs in T. arvense using information from a collection of green plants (Viridiplantae) at REXdb [5], and to classify T. arvense LTR-TEs into lineages, we used the DANTE pipeline (https://github.com/kavonrtep/dante) and its Viridiplantae v3.0 database. We used a published T. arvense TE library [26] as query with default parameters except for “—interruptions”, which we set to 10 to reflect the fact that we used as input the consensus TE models and therefore likely have frameshifts and stop codons in these sequences. Using these identified protein domains, we evaluated whether a given TE family is autonomous, i.e., whether it codes for the entire machinery needed for transposition. An LTR retrotransposon family was considered autonomous with the following domains identified by DANTE: retrotranscriptase, RT; capsid related domain, GAG; RNase H, RH; protease, PROT; and integrase, INT. Autonomous non-LTR retrotransposons, LINEs, had to contain: retrotranscriptase, RT. DNA TE families had to contain: transposase, TPase. DNA TEs of the Helitron superfamily had to contain in addition: DNA helicase, HEL.

After classification, we used the inferred amino acid sequences of the retrotranscriptase domains extracted from Ty3 and Ty1 elements identified by DANTE to produce two multiple sequence alignments using MAFFT with standard parameters [55]. Using RAxML [56], we built a set of phylogenetic trees under a JTT + gamma model, with 100 rapid bootstraps to assess the branch reliability of the NJ tree.

Analysis of intact LTR-TEs analysis and estimates of LTR-TE age used LTRpred [57] against the reference genome with default parameters. We correlated the genomic positions of the de novo predicted LTR-TEs with those in the annotation using bedtools [58] intersecting with” -f 0.8 -r” parameters.

To analyze the extent of conservation of TE families larger than 2 kb across Brassicaceae, we ran BLASTN [59] against the NCBI NT database [36], June 2022 release. Next, we filtered the result by requiring 80% identity and 80% alignment coverage of the query sequence. For Alesia.FAM.7 TE family filtered matches, we performed a multiple sequence alignment of the remaining matches using MAFFT [55] with default settings and constructed a tree with RaxML [56] with the parameters “-model JTT+G—bs-trees 100“. To de novo discover nGAAn motifs in all the sequences of Alesia.FAM.7, we ran MEME [60] with the following parameters “-mod zoops -nmotifs 3 -minw 6 -maxw 50 -objfun classic -revcomp -markov_order 0”.The de novo deemed HRE motif selected had 4 nGAAn clusters in the reverse strand: AAAGAAAGAGTGTTCTTCATAAGTTCTCTTATTCTC (E-value = 2.8e-33).

Expression analysis of reference TEs was performed using TEspeX [61]. We obtained paired-end RNA seq data from 27 samples comprising nine different tissues from the MN106-Ref reference accession [26]. We obtained raw counts for each library by mapping the reads to both transcripts of protein coding genes and to the TE consensus library. Raw counts were normalized as suggested [61] (RPM: raw counts/total mapped reads x 1 million). We used a non-parametric Wilcoxon rank-sum test to compare expression between autonomous and non-autonomous TE families.

Short variant calling

We called variants with GATK4 [62], following best practices for germline short variant discovery (https://gatk.broadinstitute.org/hc/en-us/sections/360007226651-Best-Practices-Workflows), as described in [28]. Briefly, we trimmed reads, removed adaptors, and filtered low quality bases and short reads (≤25 bp) using cutadapt v2.6 [63]. We aligned trimmed reads to the reference genome [26] with BWA-MEM v0.7.17 [64], marked duplicates with MarkDuplicatesSpark and ran Haplotypecaller, generating GVCF files for each accession. To combine GVCF files, we ran GenomicsDBImport and GenotypeGVCFs successively for each scaffold, and then merged files with GatherVcfs, to obtain a multisample VCF file. Based on quality parameters distributions, we removed low-quality variants using VariantFiltration with specific parameters for SNPs (QD < 2.0 || SOR > 4.0 || FS > 60.0 || MQ < 20.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0) and other variants (QD < 2.0 || QUAL < 30.0 || FS > 200.0 || ReadPosRankSum < -20.0). We filtered variants with vcftools v0.1.16 [65], retaining only biallelic variants with at most 10% missing genotype calls, and Minor Allele Frequency (MAF) > 0.01. Finally, we imputed missing genotype calls with BEAGLE 5.1 [66], obtaining a complete multisample VCF file. All the code for short variants calling, filtering and imputation can be found on GitHub (https://github.com/Dario-Galanti/BinAC_varcalling).

For calculating site frequency spectra, we used all biallelic SNPs with Minor Allele Count (MAC) of at least two. To assess the population structure of our dataset, we pruned variants in strong LD using PLINK [67] with the following parameters “—indep-pairwise 50 5 0.8” and then ran PCA analyses to assess the variance of natural variation. Due to the high divergence of the Armenian accessions from the rest, we ran separate PCAs with and without these accessions, to highlight the structure of the remaining populations (S4 Fig).

Lastly, we analyzed the genetic relatedness among accessions from different geographic regions constructing a maximum likelihood tree using TREEMIX [68] with 2,500 bootstrap replicates without considering migration flow and using as an outgroup two sister species, Eutrema salsugineum and Schrenkiella parvula. We merged all 2,500 independent treemix runs and generated a consensus tree with the Phylip “consense” command (https://evolution.genetics.washington.edu/phylip/).

TE polymorphism calling

To identify TE insertion polymorphisms (TIPs), we used SPLITREADER [32] as described in [69]. We applied two custom steps (https://github.com/acontrerasg/Tarvense_transposon_dynamics). In short, we removed Helitron insertions, as they have been shown to have a high false positive ratio [32].

Next, we mapped short reads from the reference accession MN106 to the reference genome to identify regions of aberrant coverage, using sample SAMEA9464759 from ENA project PRJEB46635 [26]. We calculated read coverage (RC) in 100 bp windows, adjusted for GC content [70], and excluded windows with abnormal coverage, arbitrarily defined as threefold lower or higher than the genome wide, GC content adjusted mean. Any TIPs in these regions, which corresponded to ~16% of the reference genome, were excluded from the final dataset. Lastly, we removed TIPs with >100 reads 500 bp upstream and/or downstream of the TIP, because this suggested aberrant structural variants in the sample, not reflected in the reference. To calculate the variant frequency spectra of TIPs, we classified TIPs as shared between two or more accessions if coordinates were identical. To estimate the age of insertions, we used this same classification and calculated the maximum pairwise divergence (number of SNPs) between each combination of two carriers, in the 70 kb region around the insertion [14], using simply the number of private SNPs for singletons. We then extrapolated the most likely age based on A. thaliana mutation rate [71], assuming 1 generation per year.

To detect TIPs using SPLITREADER, a collection of TEs is required. We used a representative subset of the total number of TEs present in the T. arvense reference genome, generated with a custom script. As a selection criterion, we defined representatives according to the consensus TE sequence of each family and the five longest individual members of each family. If a family consisted of < 5 members, all members were used.

We visually inspected 2,790 TIPs spanning all analyzed TE superfamilies and all accessions using a visual browser. Over 70% of TIPs were deemed correct, which is in line with reports from other studies in A. thaliana [32] and tomato [72].

To further confirm our TIPs, we generated HiFi PacBio long reads for an Armenian accession (Ames32867/TA_AM_01_01_F3_CC0_M1_1). We stratified seeds at 4°C for one month and germinated them on soil. One month after germination, we subjected plants to 24h dark prior to harvesting. We extracted high molecular weight (Hmw) DNA as described [73] using 600 mg of ground rosette material. Using a gTUBE (Covaris) we sheared 10 μg of HMW DNA to an average fragment size of of 24 kb and prepared two independent non-barcoded HiFi SMRTbell libraries using SMRTbell Express Template Prep Kit 2.0 (PacBio). We pooled the two libraries and performed size-selection with a BluePippin (SageScience) instrument with 10 kb cutoff in a 0.75% DF Marker S1 High Pass 6–10 kb v3 gel cassette (Biozym). We sequenced the library on a single SMRT Cell (30 hours movie time) with the Sequel II system (PacBio) using the Binding Kit 2.0. Using PacBio CCS with “—all” mode (https://ccs.how/), we generated HiFi reads (sum = 31 Gb, n = 1,633,975, average = 19 kb). We called structural variants (SVs) against the reference using Sniffles2 [74]. 71% of the TIPs called in this accession using short reads had a PacBio HiFi-read supported SV within 200 bp, in line with our visual assessment of TIP quality.

Using paired-end short read Illumina data, we also screened for TE absence polymorphisms (TAPs). First, we calculated the GC-corrected median read depth (RD) in genome-wide 10 bp bins for short-read data sets from all accessions and from two reference controls. For every annotated TE ≥ 300 bp, we extracted its corresponding RD-bins for both the controls and a single sample and used a non-parametric test (Wilcoxon Rank Sum) to compare the bins of the focal sample with the bins of both controls. If i) the annotated TE showed a significant difference in coverage between the focal accession and the mean of the controls, and ii) the median coverage of that TE showed at least a 10-fold reduction in the focal accession compared to the all accession median coverage, then such a TE was considered absent in the focal accession. To exclude the possibility that our TAP calls were the result of major rearrangements in the vicinity of the TAP call, we calculated the coverage of the flanking regions of the TAPs and removed those with < 5X or > 50X mean coverage.

Genome-wide Association between TE polymorphisms and genomic regions

To detect genetic variants associated with variation in TE content, we ran GWA using the number of TIPs of different classes, orders and superfamilies as phenotypes. We used mixed models implemented in GEMMA [75], correcting for population structure with an Isolation-By-State (IBS) matrix. Starting from the complete VCF file obtained from variant calling, we used PLINK [67] to prune SNPs in strong LD (—indep-pairwise 50 5 0.8) and computed the IBS matrix. We tested for associations between TIP counts and all variants with MAF > 0.04 (SNPs and short INDELs). We log-transformed TIP counts to approximate a log-normal distribution of the phenotype. To quantify the potential effects of components of the epigenetic machinery on TE content, we calculated the enrichment of associations in the proximity of a custom list of genes with connections to epigenetic processes [28] for increasing cutoffs [34]. Briefly, we assigned an “a priori candidate” status to all variants within 20 kb of the genes from the list and calculated the expected frequency as the fraction between “a priori candidate” and total variants. We calculated enrichment for -log(p) threshold increments, comparing the fraction of significant a priori candidates (observed frequency) to the expected frequency. We further calculated the expected upper bound for the false discovery rate (FDR) as described in [34]. The code to run GWA and the described enrichment analysis is available on GitHub (https://github.com/Dario-Galanti/multipheno_GWAS/tree/main/gemmaGWAS).

DNA methylation around insertions

To investigate cytosine methylation in the proximity of TIPs, we leveraged Whole Genome Bisulfite Sequencing (WGBS) data from the European accessions, using multisample unionbed files [28]. To reduce technical noise, we first excluded singleton TIPs and TiPs within 2 kb of another TIP or 1 kb to annotated TEs. We calculated average methylation of accessions with and without a focal TIP in 2 kb flanking regions. We then combined methylation values of all TIPs in 50 bp bins of the 2 kb flanking regions, averaging all positions within each bin. Finally, we calculated the moving average (arithmetic mean) of 3 bins to smoothen the curves. The workflow was based on custom bash and python scripts available at https://github.com/acontrerasg/Tarvense_transposon_dynamics.

Intersection with genomic features and Gene Ontology enrichment analysis

To investigate the targeting behavior of different TE families or superfamilies, we counted TIPs in different genomic features with bedtools [58] and divided them by the total genome space covered by each feature to obtain relative insertion density. We turned to gene ontology (GO) enrichment analysis to characterize genes potentially affected by insertions, using all genes located within 2 kb of an insertion. Briefly, we extracted GO terms from the T. arvense annotation and integrated them with the terms from A. thaliana orthologs identified by OrthoFinder2 [76]. We assessed enrichment with clusterProfiler [77] and piped all terms with p value < 0.05 to REVIGO [78], using default parameters.

Code availability

Code used for analysis and figures can be found at: https://github.com/acontrerasg/Tarvense_transposon_dynamics.

Supporting information

S1 Table. Summary statistics of previously annotated TEs for the T. arvense reference genome MN106-Ref.

https://doi.org/10.1371/journal.pgen.1011141.s001

(XLSX)

S2 Table. Lineages of LTR-TEs in the T. arvense genome MN106-Ref.

https://doi.org/10.1371/journal.pgen.1011141.s002

(XLSX)

S3 Table. List of datasets that were uploaded to the Zenodo repository: https://doi.org/10.5281/zenodo.10161730 (10.5281/zenodo.6372331).

https://doi.org/10.1371/journal.pgen.1011141.s003

(XLSX)

S1 Fig. Comparison of autonomous and non-autonomous TE families in T. arvense MN106-Ref.

(A) Absolute (left) and relative (right) fraction of autonomous and non-autonomous elements in each TE superfamily. (B) Comparison of the fraction of autonomous and non-autonomous elements in each TE superfamily (left). Size comparison of the TE copies according to their autonomy per superfamily (right). (C) Contribution of each superfamily and their autonomous/non-autonomous fraction to total genome size in Mb. (D) Distribution of size and copy number per LTR retrotransposon lineage. (E) TE expression in autonomous vs. non-autonomous TEs.

https://doi.org/10.1371/journal.pgen.1011141.s004

(TIF)

S2 Fig. SNP-based maximum likelihood tree of T. arvense populations.

Based on a model without migration, 2,500 bootstraps. Node weights represent bootstrap values. Outgroup species at the bottom.

https://doi.org/10.1371/journal.pgen.1011141.s005

(TIF)

S3 Fig. Frequency distribution of TIPs overlapping with annotated genes and TEs.

TIP allele frequencies near other TEs are significantly lower than near genes (Wilcoxon Rank Sum test, p < 2.22E-16).

https://doi.org/10.1371/journal.pgen.1011141.s006

(TIF)

S4 Fig. SNP-based PCA of a subset of T. arvense accessions.

The Armenian accessions, which are outliers in the PCA using all accessions (Fig 2), were excluded from this new PCA analysis, which shows how Chinese and European accessions cluster separately. We also observe part of the south Sweden accessions clustering far from the rest of the European accessions.

https://doi.org/10.1371/journal.pgen.1011141.s007

(TIF)

S5 Fig. PCA analysis of 279 individuals of T. arvense.

A presence/absence matrix of either TIPs (left) or TAPs, (right) was used as input to calculate PCA. This result recapitulates the clustering pattern observed with the SNP-PCA.

https://doi.org/10.1371/journal.pgen.1011141.s008

(TIF)

S6 Fig. Genomic distribution of TIPs and TAPs along all seven chromosomes of T. arvense.

Color columns indicate to which biogeographical population each accession belongs to.

https://doi.org/10.1371/journal.pgen.1011141.s009

(TIF)

S7 Fig. Complete GWA results for TIP load.

Left: Manhattan plots for each TIP superfamily load. The genome-wide significance (red line) corresponds to a full Bonferroni correction, the suggestive line (blue) to a more generous hard threshold of –log(p) = 5. Genes next to top variants are labeled with names, blue font indicates genes with link to DNA methylation included in the enrichment analyses. Middle: Enrichment and expected FDR of genes with link to DNA methylation, for significance threshold increments [28,34]. Right: QQplots of p-values.

https://doi.org/10.1371/journal.pgen.1011141.s010

(TIF)

S8 Fig. Zoom-in of GWA peaks with candidate genes highlighted.

The genome-wide significance (dotted red line) corresponds to a full Bonferroni correction. DNA methylation machinery genes used for the enrichment of a priori candidates are depicted in blue, other genes that might affect transposition in red. The putative knock-out SNP disrupting the function of BRAT1 is depicted in green.

https://doi.org/10.1371/journal.pgen.1011141.s011

(TIF)

S9 Fig. GWA results for genome-wide load of TIPs younger than 500,000 years.

Left: Manhattan plots for load of TIPs for each TE superfamily. The genome-wide significance (red line) corresponds to a full Bonferroni correction, the suggestive line (blue) to a more generous hard threshold of –log(p) = 5. Genes next to top variants are labeled with names, blue font indicates genes with link to DNA methylation included in the enrichment analyses. Middle: Enrichment and expected FDR of genes with links to DNA methylation, for significance threshold increments. Right: QQplots of p-values.

https://doi.org/10.1371/journal.pgen.1011141.s012

(TIF)

S10 Fig. BLASTN hits of T. arvense TE families with model sizes > 4 kb against the NCBI NT database, June 2022 release.

We filtered the matches using the 80/80/80 rule, and further constrained matches to fulfill > 2kb length criteria. The x-axis denotes the number of species with at least 1 hit. Each family has at least one hit, namely T. arvense itself. TE families with more than 5 hits are highlighted. The number of TIPs in T. arvense populations is shown in parentheses for the highlighted families to indicate that there is no obvious correlation between mobility in T. arvense and phylogenetic conservation.

https://doi.org/10.1371/journal.pgen.1011141.s013

(TIF)

Acknowledgments

We thank Haim Ashkenazy, Wei Yuan and Gautam Shirsekar for technical advice, Christa Lanz for support during PacBio HiFi library preparation and Alejandra Duque-Jaramillo, Tess Renahan and Rebecca Schwab for comments on the manuscript. We also thank Kevin M. Dorn for sharing T. arvense seeds. For computing, we acknowledge Prof. Peter Stadler at the University of Leipzig and David Langenberger from ecSeq, for hosting the EpiDiverse servers, and the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen for managing the BinAC server.

References

  1. 1. Wells JN, Feschotte C. A Field Guide to Eukaryotic Transposable Elements. Annu Rev Genet. 2020. pmid:32955944
  2. 2. Tenaillon MI, Hollister JD, Gaut BS. A triptych of the evolution of plant transposable elements. Trends Plant Sci. 2010;15: 471–478. pmid:20541961
  3. 3. Wicker T, Gundlach H, Spannagl M, Uauy C, Borrill P, Ramírez-González RH, et al. Impact of transposable elements on genome structure and evolution in bread wheat. Genome Biol. 2018;19: 103. pmid:30115100
  4. 4. Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, et al. A unified classification system for eukaryotic transposable elements. Nat Rev Genet. 2007;8: 973–982. pmid:17984973
  5. 5. Neumann P, Novák P, Hoštáková N, Macas J. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mob DNA. 2019;10: 1. pmid:30622655
  6. 6. Arkhipova IR. Using bioinformatic and phylogenetic approaches to classify transposable elements and understand their complex evolutionary histories. Mob DNA. 2017;8: 19. pmid:29225705
  7. 7. Zhang H, Lang Z, Zhu J-K. Dynamics and function of DNA methylation in plants. Nat Rev Mol Cell Biol. 2018;19: 489–506. pmid:29784956
  8. 8. Bouyer D, Kramdi A, Kassam M, Heese M, Schnittger A, Roudier F, et al. DNA methylation dynamics during early plant life. Genome Biol. 2017;18: 179. pmid:28942733
  9. 9. Sigman MJ, Slotkin RK. The First Rule of Plant Transposable Element Silencing: Location, Location, Location. Plant Cell. 2016;28: 304–313. pmid:26869697
  10. 10. Srikant T, Drost H-G. How stress facilitates phenotypic innovation through epigenetic diversity. Front Plant Sci. 2020;11: 606800. pmid:33519857
  11. 11. Pecinka A, Dinh HQ, Baubec T, Rosa M, Lettner N, Mittelsten Scheid O. Epigenetic regulation of repetitive elements is attenuated by prolonged heat stress in Arabidopsis. Plant Cell. 2010;22: 3118–3129. pmid:20876829
  12. 12. Cavrak VV, Lettner N, Jamge S, Kosarewicz A, Bayer LM, Mittelsten Scheid O. How a retrotransposon exploits the plant’s heat stress response for its activation. PLoS Genet. 2014;10: e1004115. pmid:24497839
  13. 13. Ito H, Gaubert H, Bucher E, Mirouze M, Vaillant I, Paszkowski J. An siRNA pathway prevents transgenerational retrotransposition in plants subjected to stress. Nature. 2011;472: 115–119. pmid:21399627
  14. 14. Baduel P, Leduque B, Ignace A, Gy I, Gil J Jr, Loudet O, et al. Genetic and environmental modulation of transposition shapes the evolutionary potential of Arabidopsis thaliana. Genome Biol. 2021;22: 138. pmid:33957946
  15. 15. Ou S, Collins T, Qiu Y, Seetharam AS, Menard CC, Manchanda N, et al. Differences in activity and stability drive transposable element variation in tropical and temperate maize. bioRxiv. 2022. p. 2022.10.09.511471.
  16. 16. Benoit M, Drost H-G, Catoni M, Gouil Q, Lopez-Gomollon S, Baulcombe DC, et al. Environmental and epigenetic regulation of Rider retrotransposons in tomato. bioRxiv. 2019. p. 517508. pmid:31525177
  17. 17. Esposito S, Barteri F, Casacuberta J, Mirouze M, Carputo D, Aversano R. LTR-TEs abundance, timing and mobility in Solanum commersonii and S. tuberosum genomes following cold-stress conditions. Planta. 2019;250: 1781–1787. pmid:31562541
  18. 18. Paszkowski J. Controlled activation of retrotransposition for plant breeding. Curr Opin Biotechnol. 2015;32: 200–206. pmid:25615932
  19. 19. McGinn M, Phippen WB, Chopra R, Bansal S, Jarvis BA, Phippen ME, et al. Molecular tools enabling pennycress (Thlaspi arvense) as a model plant and oilseed cash cover crop. Plant Biotechnol J. 2018. pmid:30230695
  20. 20. García Navarrete T, Arias C, Mukundi E, Alonso AP, Grotewold E. Natural variation and improved genome annotation of the emerging biofuel crop field pennycress (Thlaspi arvense). G3. 2022. pmid:35416986
  21. 21. Dorn KM, Fankhauser JD, Wyse DL, Marks MD. A draft genome of field pennycress (Thlaspi arvense) provides tools for the domestication of a new winter biofuel crop. DNA Res. 2015;22: 121–131. pmid:25632110
  22. 22. Hill J, Nelson E, Tilman D, Polasky S, Tiffany D. Environmental, economic, and energetic costs and benefits of biodiesel and ethanol biofuels. Proc Natl Acad Sci U S A. 2006;103: 11206–11210. pmid:16837571
  23. 23. Cubins JA, Wells MS, Frels K, Ott MA, Forcella F, Johnson GA, et al. Management of pennycress as a winter annual cash cover crop. A review. Agron Sustain Dev. 2019;39: 46.
  24. 24. Frels K, Chopra R, Dorn KM, Wyse DL, Marks MD, Anderson JA. Genetic Diversity of Field Pennycress (Thlaspi arvense) Reveals Untapped Variability and Paths Toward Selection for Domestication. Agronomy. 2019;9: 302.
  25. 25. Warwick SI, Francis A, Susko DJ. The biology of Canadian weeds. 9. Thlaspi arvense L. (updated). Can J Plant Sci. 2002;82: 803–823.
  26. 26. Nunn A, Rodríguez-Arévalo I, Tandukar Z, Frels K, Contreras-Garrido A, Carbonell-Bejerano P, et al. Chromosome-level Thlaspi arvense genome provides new tools for translational research and for a newly domesticated cash cover crop of the cooler climates. Plant Biotechnol J. 2022. pmid:34990041
  27. 27. Hu Y, Wu X, Jin G, Peng J, Leng R, Li L, et al. Rapid Genome Evolution and Adaptation of Thlaspi arvense Mediated by Recurrent RNA-Based and Tandem Gene Duplications. Front Plant Sci. 2021;12: 772655. pmid:35058947
  28. 28. Galanti D, Ramos-Cruz D, Nunn A, Rodríguez-Arévalo I, Scheepens JF, Becker C, et al. Genetic and environmental drivers of large-scale epigenetic variation in Thlaspi arvense. PLoS Genet. 2022;18: e1010452. pmid:36223399
  29. 29. Bourgeois Y, Boissinot S. On the Population Dynamics of Junk: A Review on the Population Genomics of Transposable Elements. Genes. 2019;10. pmid:31151307
  30. 30. Dubin MJ, Zhang P, Meng D, Remigereau M-S, Osborne EJ, Paolo Casale F, et al. DNA methylation in Arabidopsis has a genetic basis and shows evidence of local adaptation. Elife. 2015;4: e05255. pmid:25939354
  31. 31. Sasaki E, Gunis J, Reichardt-Gomez I, Nizhynska V, Nordborg M. Conditional GWAS of non-CG transposon methylation in Arabidopsis thaliana reveals major polymorphisms in five genes. PLoS Genet. 2022;18: e1010345. pmid:36084135
  32. 32. Quadrana L, Bortolini Silveira A, Mayhew GF, LeBlanc C, Martienssen RA, Jeddeloh JA, et al. The Arabidopsis thaliana mobilome and its impact at the species level. Elife. 2016;5: e15716. pmid:27258693
  33. 33. Erdmann RM, Picard CL. RNA-directed DNA Methylation. PLoS Genet. 2020;16: e1009034. pmid:33031395
  34. 34. Atwell S, Huang YS, Vilhjálmsson BJ, Willems G, Horton M, Li Y, et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature. 2010;465: 627–631. pmid:20336072
  35. 35. Pietzenuk B, Markus C, Gaubert H, Bagwan N, Merotto A, Bucher E, et al. Recurrent evolution of heat-responsiveness in Brassicaceae COPIA elements. Genome Biol. 2016;17: 209. pmid:27729060
  36. 36. Sayers EW, Bolton EE, Brister JR, Canese K, Chan J, Comeau DC, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 2022;50: D20–D26. pmid:34850941
  37. 37. Krämer U. Planting molecular functions in an ecological context with Arabidopsis thaliana. Elife. 2015;4. pmid:25807084
  38. 38. Bourque G, Burns KH, Gehring M, Gorbunova V, Seluanov A, Hammell M, et al. Ten things you should know about transposable elements. Genome Biol. 2018;19: 199. pmid:30454069
  39. 39. Sultana T, Zamborlini A, Cristofari G, Lesage P. Integration site selection by retroviruses and transposable elements in eukaryotes. Nat Rev Genet. 2017;18: 292–308. pmid:28286338
  40. 40. Quadrana L, Etcheverry M, Gilly A, Caillieux E, Madoui M-A, Guy J, et al. Transposition favors the generation of large effect mutations that may facilitate rapid adaption. Nat Commun. 2019;10: 3421. pmid:31366887
  41. 41. Stritt C, Thieme M, Roulin AC. Rare transposable elements challenge the prevailing view of transposition dynamics in plants. Am J Bot. 2021;108: 1310–1314. pmid:34415576
  42. 42. Zhang S-J, Liu L, Yang R, Wang X. Genome Size Evolution Mediated by Gypsy Retrotransposons in Brassicaceae. Genomics Proteomics Bioinformatics. 2020;18: 321–332. pmid:33137519
  43. 43. Geng Y, Guan Y, Qiong L, Lu S, An M, Crabbe MJC, et al. Genomic analysis of field pennycress (Thlaspi arvense) provides insights into mechanisms of adaptation to high elevation. BMC Biol. 2021;19: 143. pmid:34294107
  44. 44. Kim S, Park M, Yeom S-I, Kim Y-M, Lee JM, Lee H-A, et al. Genome sequence of the hot pepper provides insights into the evolution of pungency in Capsicum species. Nat Genet. 2014;46: 270–278. pmid:24441736
  45. 45. Sasaki E, Kawakatsu T, Ecker JR, Nordborg M. Common alleles of CMT2 and NRPE1 are major determinants of CHH methylation variation in Arabidopsis thaliana. PLoS Genet. 2019;15: e1008492. pmid:31887137
  46. 46. Ichino L, Boone BA, Strauskulage L, Jake Harris C, Kaur G, Gladstone MA, et al. MBD5 and MBD6 couple DNA methylation to gene silencing through the J-domain protein SILENZIO. Science. 2021 [cited 4 Jun 2021]. pmid:34083448
  47. 47. Specchia V, Bozzetti MP. The Role of HSP90 in Preserving the Integrity of Genomes Against Transposons Is Evolutionarily Conserved. Cells. 2021;10. pmid:34064379
  48. 48. Cappucci U, Noro F, Casale AM, Fanti L, Berloco M, Alagia AA, et al. The Hsp70 chaperone is a major player in stress-induced transposable element activation. Proc Natl Acad Sci U S A. 2019;116: 17943–17950. pmid:31399546
  49. 49. Specchia V, Piacentini L, Tritto P, Fanti L, D’Alessandro R, Palumbo G, et al. Hsp90 prevents phenotypic variation by suppressing the mutagenic activity of transposons. Nature. 2010;463: 662–665. pmid:20062045
  50. 50. Volff J-N. Turning junk into gold: domestication of transposable elements and the creation of new genes in eukaryotes. Bioessays. 2006;28: 913–922. pmid:16937363
  51. 51. Jangam D, Feschotte C, Betrán E. Transposable Element Domestication As an Adaptation to Evolutionary Conflicts. Trends Genet. 2017;33: 817–831. pmid:28844698
  52. 52. Almeida MV, Vernaz G, Putman ALK, Miska EA. Taming transposable elements in vertebrates: from epigenetic silencing to domestication. Trends Genet. 2022;38: 529–553. pmid:35307201
  53. 53. Drost H-G, Sanchez DH. Becoming a Selfish Clan: Recombination Associated to Reverse-Transcription in LTR Retrotransposons. Genome Biol Evol. 2019;11: 3382–3392. pmid:31755923
  54. 54. Tsuchiya T, Eulgem T. An alternative polyadenylation mechanism coopted to the Arabidopsis RPP7 gene through intronic retrotransposon domestication. Proc Natl Acad Sci U S A. 2013;110: E3535–43. pmid:23940361
  55. 55. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 2013;30: 772–780. pmid:23329690
  56. 56. Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30: 1312–1313. pmid:24451623
  57. 57. Drost H-G. LTRpred: de novo annotation of intact retrotransposons. JOSS. 2020;5: 2170.
  58. 58. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010;26: 841–842. pmid:20110278
  59. 59. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10: 421. pmid:20003500
  60. 60. Bailey TL, Johnson J, Grant CE, Noble WS. The MEME Suite. Nucleic Acids Res. 2015;43: W39–49. pmid:25953851
  61. 61. Ansaloni F, Gualandi N, Esposito M, Gustincich S, Sanges R. TEspeX: consensus-specific quantification of transposable element expression preventing biases from exonized fragments. Bioinformatics. 2022;38: 4430–4433. pmid:35876845
  62. 62. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20: 1297–1303. pmid:20644199
  63. 63. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17: 10–12.
  64. 64. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25: 1754–1760. pmid:19451168
  65. 65. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27: 2156–2158. pmid:21653522
  66. 66. Browning BL, Zhou Y, Browning SR. A One-Penny Imputed Genome from Next-Generation Reference Panels. Am J Hum Genet. 2018;103: 338–348. pmid:30100085
  67. 67. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81: 559–575. pmid:17701901
  68. 68. Pickrell JK, Pritchard JK. Inference of population splits and mixtures from genome-wide allele frequency data. PLoS Genet. 2012;8: e1002967. pmid:23166502
  69. 69. Baduel P, Quadrana L, Colot V. Efficient Detection of Transposable Element Insertion Polymorphisms Between Genomes Using Short-Read Sequencing Data. In: Cho J, editor. Plant Transposable Elements: Methods and Protocols. New York, NY: Springer US; 2021. pp. 157–169.
  70. 70. Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009;19: 1586–1592. pmid:19657104
  71. 71. Ossowski S, Schneeberger K, Lucas-Lledó JI, Warthmann N, Clark RM, Shaw RG, et al. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science. 2010;327: 92–94. pmid:20044577
  72. 72. Domínguez M, Dugas E, Benchouaia M, Leduque B, Jiménez-Gómez JM, Colot V, et al. The impact of transposable elements on tomato diversity. Nat Commun. 2020;11: 4058. pmid:32792480
  73. 73. Rabanal FA, Gräff M, Lanz C, Fritschi K, Llaca V, Lang M, et al. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 2022;50: 12309–12327. pmid:36453992
  74. 74. Sedlazeck FJ, Rescheneder P, Smolka M, Fang H, Nattestad M, von Haeseler A, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods. 2018;15: 461–468. pmid:29713083
  75. 75. Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44: 821–824. pmid:22706312
  76. 76. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20: 238. pmid:31727128
  77. 77. Yu G, Wang L-G, Han Y, He Q-Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS. 2012;16: 284–287. pmid:22455463
  78. 78. Supek F, Bošnjak M, Škunca N, Šmuc T. REVIGO summarizes and visualizes long lists of gene ontology terms. PLoS One. 2011;6: e21800. pmid:21789182