The complex architecture of plant transgene insertions

Over the last 35 years the soil bacterium Agrobacterium tumefaciens has been the workhorse tool for plant genome engineering. Replacement of native tumor-inducing (Ti) plasmid elements with customizable cassettes enabled insertion of a sequence of interest called Transfer DNA (T-DNA) into any plant genome. Although these T-DNA transfer mechanisms are well understood, detailed understanding of structure and epigenomic status of insertion events was limited by current technologies. To fill this gap, we analyzed transgenic Arabidopsis thaliana lines from three widely used collections (SALK, SAIL and WISC) with two single molecule technologies, optical genome mapping and nanopore sequencing. Optical maps for four randomly selected T-DNA lines revealed between one and seven insertions/rearrangements, and for the first time the actual length of individual transgene insertions from 27 to 236 kilobases. De novo nanopore sequencing-based genome assemblies for two segregating lines resolved T-DNA structures up to 36 kb into the insertions and revealed large-scale T-DNA associated translocations and exchange of chromosome arm ends. The multiple internally rearranged nature of T-DNA arrays made full assembly impossible, even with long nanopore reads. For the current TAIR10 reference genome, nanopore contigs corrected 83% of non-centromeric misassemblies. This unprecedented nucleotide-level definition of T-DNA insertions enabled the mapping of epigenome data. We identify variable small RNA transgene targeting and DNA methylation. SALK_059379 T-DNA insertions were enriched for 24nt siRNAs and contained dense cytosine DNA methylation. Transgene silencing via the RNA-directed DNA methylation pathway was confirmed by in planta assays. In contrast, SAIL_232 T-DNA insertions are predominantly targeted by 21/22nt siRNAs, with DNA methylation and silencing limited to a reporter, but not the resistance gene. With the emergence of genome editing technologies that rely on Agrobacterium for gene delivery, this study provides new insights into the structural impact of engineering plant genomes and demonstrates the utility of state-of-the-art long-range sequencing technologies to rapidly identify unanticipated genomic changes.


Introduction
Plant genome engineering using the soil microorganism Agrobacterium tumefaciens has revolutionized plant science and agriculture by enabling identification and testing of gene functions and providing a mechanism to equip plants with superior traits [1,2,3]. Transfer (T)-DNA insertional mutant projects have been conducted in important dicot and monocot models, and over 700,000 lines with gene affecting insertions have been generated in Arabidopsis thaliana ( Arabidopsis henceforth) alone (reviewed in O'Malley [4]). Targeted T-DNA sequencing approaches were conducted on approximately 325,000 of these lines to identify the disruptive transgene insertions and to link genotype with phenotype.
This wealth of information has been made available on this website http://signal.salk.edu/Source/AtTOME_Data_Source.html , which was iteratively updated since 2001, and which was accessed over 10 million times by 2017.
The Agrobacterium strains used in research projects are no longer harmful to the plant because the oncogenic elements of the tumor-inducing (Ti) plasmid have been replaced by a customizable cassette that includes a diverse set of in planta regulatory elements.
Agrobacterium -mediated transgene integration occurs through excision of the T-DNA strand between two imperfect terminal repeat sequences [5], the left border (LB) and right border (RB) [6], and translocation into the host cell nucleus (reviewed in Nester [7]). Hijacking the plant molecular machinery, the T-DNA is integrated at naturally occurring double strand breaks through annealing and repair at sites of microhomology [8,9]. While the exact mechanisms behind this error prone integration are poorly understood, it is known that insertion events generally occur at multiple locations throughout the genome [5,10]. T-DNA insertions also frequently contain the vector backbone, and occur as direct or inverted repeats of the T-DNA, resulting in large intra-and inter-chromosomal rearrangements [6,11,12,13,14,15,16,17,18,19,20].
Knowledge of insertion site, copy number and potential backbone insertions is crucial from scientific as well as regulatory perspectives. These aspects are routinely assessed using laborious Southern blotting, Thermal Asymmetric Interlaced (TAIL) PCR or targeted short-read sequencing [4]. One of the few attempts to gain deeper insight into an engineered genome was for transgenic Papaya , using a Sanger sequencing approach [14]. This work identified three insertion events, each less than 10 kilobases (kb) in length, however large repeat structures with high sequence identity are generally impossible to assemble using short-read sequences [21]. Current knowledge of structural genome changes and epigenetic stability in transgenic plants is limited. In this study we report on the genome structures of four Arabidopsis T-DNA floral-dip transformed plants, and are for the first time able to report the lengths of T-DNA insertions up to 236 kilobases. We have long-molecule evidence for further genome structural rearrangements including chromosomal translocations. To study such large insertions and rearrangements at the sequence level, we de novo assembled the genomes of two multi-insert lines (SALK_059379 and SAIL_232) and the reference accession Columbia-0 (Col-0) using Oxford Nanopore MinION (ONT) reads to very high contiguity. We present polished contigs that span chromosome arms and reveal the scrambled nature of T-DNA and vector backbone insertions and rearrangements in high detail. We subsequently tested transgene expression and functionality, and show differential epigenetic effects of the insertions on the transgenes between the two tested vector backbones. Small interfering RNA (siRNAs) species induced transgene silencing through the RNA-directed DNA methylation (RdDM) pathway of the entire T-DNA strand in the SALK-vector background, and in contrast the transgene remained active in the SAIL-line.
In this work, we report how technological advances enabled us to establish the size of T-DNA insertions in four independent mutant lines, and to assemble and analyse the genomes and epigenomes of two such lines to unprecedented detail.

Optical maps reveal size and structure of T-DNA insertions
To study both the location and size of T-DNA insertions into the genome, we randomly selected four transgenic Arabidopsis lines and generated optical genome maps using the Bionano Genomics Irys system (BNG; San Diego, CA) from pooled leaf tissue. These Columbia-based plant lines have previously been transformed by Agrobacterium using different binary vector constructs: SALK_059379 and SALK_075892 with pROK2 (T-DNA strand 4.4 kb; [22]) , SAIL_232 with pCSA110 ( T-DNA strand 7.1 kb; [23,24]), and WiscDsLox_449D11 with pDSLox ( T-DNA strand 9.1 kb; [19]). These segregating lines had no visual mutation-induced phenotype, and resembled their Columbia parent plants phenotypically. The fluorescently-labeled DNA molecules used for optical mapping had an average length of up to 288 kb which enables conservation of long-range information across transgene insertion sites (  TAIR10 reference genome sequence and observed up to   four structural variations, here T-DNA insertions, with sizes ranging from 27 kb to 236 kb (Fig. 1,   2, Table 1, and Supplementary Table 2). Although insertion sizes larger than the actual T-DNA cassette have been reported [6,11,12,17,19,20], the here measured length for insertions derived from short T-strands exceeded our expectations by~20-60 fold. Moreover, optical genome maps of the SAIL_232 line identified a total of seven genomic changes including three insertion events, one inversion involving~500 kb on chromosome 1, an inverted translocation on chromosome 3 that involves the exchange of two adjacent regions between 2.6-3.4Mb (847 kb) with 8.9-10.1 Mb (1,193 kb), as well as a swap between chromosome arm ends (chromosomes 3 and 5; Fig. 2). Previous short-read sequencing projects to identify insertion numbers (TDNAseq; http://signal.salk.edu/cgi-bin/tdnaexpress ) provided evidence for two (WiscDsLox_449D11), three (SALK_075892), four (SALK_059379) and five (SAIL_232) insertion sites, thus less than what we have observed through the optical genome mapping approach (Supplementary Table S5).

Assembly of highly contiguous genomes from MinION data
The number of insertions in SALK_059379 and the type of rearrangements observed in SAIL_232 sparked our interest in analyzing these genomes at greater (nucleotide) resolution.
We sequenced these engineered genomes, alongside the reference Col-0 (ABRC accession CS70000) using the ONT MinION device. We performed nanopore sequencing on each line using a single R9.4 flow cell (Table 1 and Supplementary Table 1) and assembled each genome using minimap/miniasm followed by three rounds of racon [25] and one round of Pilon [26]. We assembled the three lines into 40 contigs (Col-0; longest 16 Fig. 1). Chromosome arm spanning contigs covered telomere repeats and at least the first centromeric repeat, thus capturing 100% of the genic content. The remaining~3.9 Mb alignment-free reference regions were exclusively centromeric, and are also not necessarily correct in the TAIR10 reference. The Col-0 contigs covered over 99% of the TAIR10 reference, and the only discrepancies occur at the centromeres (Supplementary Fig. 1). High contiguity and quality of this assembly closed 38 of 46 previously identified non-centromeric misassemblies in the TAIR10 reference genome  Table 4).
One aim of creating near complete genome assemblies was to enable the structural resolution of transgene insertions at nucleotide level, rather than with genome scaffolding alone.
We next assessed contiguity at sites of T-DNA insertion and after alignment to optical maps ABRC accession CS873942). To test this hypothesis, we genotyped SAIL_232 alongside two randomly selected lines of the same collection (SAIL_59 and SAIL_107) using primers specific to the reference Col-0 CS7000 genome and the SAIL-inverted state (see Methods). PCR analysis confirmed that this "inversion" was common to all three independent SAIL-lines tested, and absent from Col-0 CS70000. Thus the event was not due to the T-DNA mutagenesis, rather is an example of the genomic "drift" among strains used as Columbia "reference".

SALK_059379 T-DNA insertions are conglomerates of T-strand and vector backbone
To annotate T-DNA insertion sites within the assembled genomes, we searched for pROK2 and pCSA110 plasmid vector sequence fragments within the assembled contigs (Supplementary Table 5). The assembled SALK_059379 genome contained three of the four optical map identified T-DNA insertions: SALK:chr1_5Mb, chr2_15Mb and chr2_18Mb (Table 1, The second and third insertions, SALK:chr1_5Mb (131 kb) and SALK:chr2_15Mb (207 kb), were partially assembled into contigs. SALK:chr1_5Mb contig_10 and contig_5 contain 25,132 bp and 33,736 bp T-DNA segments. Similarly, the extremely long chr2_15Mb insertion was partially contained within contig_4 and contig_7 (Fig. 1c), leaving an unassembled gap of 131 kb. The recovered structure of this insertion is noteworthy as it represents a conglomerate of intact T-DNA/backbone concatemers, as well as various breakpoints that introduced partial vector fragments with frequent changes of the insertion direction (Fig. 1c). Finally, the fourth insertion SALK_059379 (SALK:chr4_10Mb) was absent from the assembly. However, we While optical maps were successful in placing long "T-DNA only" sequence contigs into the large gaps (e.g. SALK:chr1_5Mb), the four "short" T-DNA-only sequence contigs of~50 kb or less did not contain sufficiently unique nicking pattern to confidently facilitate contig placement.

Large-scale rearrangements reshape the SAIL_232 genome
We next searched the SAIL_232 ONT contigs for pCSA110 vector fragments, and were able to confirm all optical map observed genome insertions (Table 1). This search additionally identified a further T-DNA insertion at chr5:20,476,509 (Supplementary Table 5) that was not assembled in the optical genome maps.
We found that chromosome 3 harbored two major rearrangements (Fig. 2). The first was a translocation of a 1.19 Mb fragment (chr3:8,902,305-10,095,395), which split at an internal T-DNA insertion at reference position 9,343,053 bp (Fig. 2a). The resulting two fragments were independently inverted prior to integration just before position chr3:2,586,494 ( Fig. 2 a,b). The second major change was a swap between the distal arms of chromosomes 3 and 5, which is supported by two optical maps as well as two ONT contigs ( Finally, the 81-kb insertion at SAIL:chr1_19Mb, consisting of four tandem T-DNA copies (~30 kb) followed by~20 kb of breakpoint interspersed T-DNA and vector backbone was partially assembled (50,676 bp) at the 5' end of contig_5 (Fig. 2e). Although not assembled as part of the flanking SAIL_contig_47, we recovered single ONT reads that contain T-DNA as well as genomic DNA fragments. In summary, our optical maps perfectly aligned with the sequenced-based assembly of the T-DNA insertion haplotype (Fig. 2e).

T-DNA Integration occurs independently from both double-strand break ends
While both sequenced lines share similar numbers of T-DNA insertion events, the genome of the SAIL_232 plant line underwent more significant changes to its architecture. All genome insertion sites began and ended with the left border (LB) of the T-DNA strand, providing evidence for independent transgene integration at both ends of the DNA double-strand break.
We did not recover any LB sequences at the chromosome/T-DNA junction, in line with literature reports that usually 73 -113 bp are missing from the LB sequence inwards [15,27]. Internal T-DNA sequence deletions were also seen at breakpoints within the insertion (Fig. 1c, Fig. 2c).
As observed for the SALK:chr2_18Mb chromosomal deletion ( Fig. 1e and Supplementary Fig.   2), we cannot exclude that long homologous stretches between the independently inserted T-DNA/vector backbone concatemers represent inverted repeats.

Transgenes are functional in SAIL-lines but are silenced in SALK-lines
We next wanted to assess the effects of T-DNA insertions on the epigenomic landscape.
The pROK2 T-DNA strand contains the kanamycin antibiotic-resistance gene nptII under control of the bacterial nopaline synthase promoter (NOSp) and terminator (NOSt), and an empty multiple cloning site under control of the widely used Cauliflower mosaic virus 35S (CaMV 35S) promoter and NOSt. The CaMV 35S constitutive overexpression promoter has previously been described to cause homology-dependent transcriptional gene silencing (TGS) in crosses with other mutant plants that already contain a CaMV 35S promoter driven transgene [28,29]. In germination assays, we confirmed that the kanamycin selective marker is not functional in SALK lines propagated for more than a few generations [4]. While~75% of SALK_059379 seed initially germinated on kanamycin-containing plates, these seedlings stopped growth with the first root and cotyledons emerging and we were not able to recover adult plants (Fig. 3a). The SAIL_232 pCSA110 T-DNA segment encodes the herbicide resistance gene bar (phosphinothricin acetyltransferase), under control of a mannopine synthase promoter [23]. In contrast to SALK_059379, we confirmed proper transgene function by applying herbicide to soil-germinated plants [30,31] (Fig. 3b). We corroborated these differential phenotypes by mapping RNAseq reads to the corresponding transformation plasmid and found that the SAIL_232 bar gene was expressed, while in SALK_059379 the nptII gene was not expressed, most likely due to epigenetic silencing (Fig. 3c,d).

Differential siRNA species define transgene silencing
We hypothesized that the observed plasmid dependent transgene expression is dictated by gene silencing via RNA-directed DNA methylation (RdDM) pathways [32]. To test the effects of small RNA and cytosine DNA methylation, we sequenced small RNA populations, as well as bisulfite converted whole genome DNA libraries for both lines. Our analyses identified abundant, yet divergent small RNA species (15-20nt, 21nt, 22nt, 23nt, 24nt) that mapped to genomic insertions of the nptII (pROK2) and bar (pCSA110) vector sequences in the SALK and SAIL lines, respectively (Fig. 3c,d). The nptII gene was highly targeted by 21/22nt short interfering (si)RNAs, as well as RdDM promoting 24nt siRNAs. The NOSp and duplicated NOSt were equally targeted by 22nt and 24nt siRNAs, although at lower numbers than the nptII gene itself (Fig. 3e). In contrast, the SAIL_232 bar gene and its promoter were targeted by high numbers of 21nt siRNAs, and lacking 24-mers (Fig. 3f). Interestingly, the pCSA110 encoded pollen specific promoter pLAT52, promoting GUS, was the only element highly targeted by 24nt siRNAs as well as high cytosine methylation levels in all sequence contexts (CG, CHG and CHH methylation, where H is A, C or T) in the SAIL background (Fig. 3d, f; Supplementary Table 6). In contrast, the entire pROK2 T-DNA region has continuously high cytosine methylation in all three contexts.
We found all elements to be fully CG and CHG methylated, and between 30% ( nptII ) and 71% (NOSt) of CHH's were methylated (Supplementary Table 7).
In summary, the GUS reporter in pCSA110 and all pROK2 transgenic elements are targeted by RdDM gene silencing pathway 24nt siRNAs. In contrast, the SAIL_232 pCSA110 bar gene was only targeted by 21/22nt siRNAs without any apparent effects on expression and the chemical resistance phenotype.
We were curious whether the 24nt siRNA and DNA methylation of the GUS gene transgenes can be only partially present, or present but differentially or not expressed (e.g. Gelvin [10], Peach and Velten [35]). The majority of studied cases enforce the conclusion that a transgene's destiny is determined by alterations to the genome structure at the site of insertion or the structure of the insertion itself, whereby both can induce epigenetic features with detrimental effects on the transgene function. In order to understand these structural effects better, these need to be resolved. However, all attempts were limited due to the short-read length of sequencing technologies [36] and the barely proven repetivity of concatenated identical T-DNA (transgene) and vector backbone insertions [13]. Yet, the long read sequences were still not sufficient to provide complete contiguity of the highly repetitive T-DNA insertions or the centromeres. Although optical maps had a shorter N50 compared with ONT sequence contigs, the biological triggers that "disrupt" contig extension are different for both applications. Around the T-DNA insertion sites, we were able to take advantage of this, as optical genome maps preserved contiguity over all insertions and rearrangements, and thus provided absolute proof for these events in contrast to e.g. split-read mapping of short Illumina reads [36]. In addition, these maps provided a valuable size measure for the transgene insertion events. In summary, the combination of optical genome maps and nanopore sequencing provided unprecedented insight not only into the size of T-DNA insertions, but also helped us to describe the scrambled nature of transgene insertions.
Long reads provide invaluable means to analyze repetitive genome regions with higher certainty. We showed that the T-DNA sequence at the insertion start and end is in all cases derived from the same T-strand region, around 60 nucleotides into the left border sequence.
Loss of the LB-end is most likely the result of endonuclease activity on the unprotected T-strand end [38]. This observation, together with the highly scrambled insertion structure, led us to hypothesize that the final T-DNA insertion derives from two independent T-DNA insertions at each end of the genomic breakpoint, subsequent daisy-chaining into the observed long T-DNA concatemers, and possible interaction among the many resulting homologous regions. Long homologous stretches between the independently inserted T-DNA/vector backbone concatemers represent inverted repeats which can lead to hairpin formation, which stalls the recombination fork and excision during DNA repair [39]. While internal breakpoints and the scrambled insertion pattern that we observed are in support of this, we have no data to confirm whether this happened before, during, or after integration (Fig. 1, 2, Supplementary Fig. 2). We Understanding why multi-insertions occur so frequently, and how these can be genetically or biochemically eliminated, will be of great interest also to the agricultural biotech industry.
Previous reports, spoken communication, and anecdotal experiences provided plenty of evidence that the antibiotic resistance marker genes, part of the transgene insertion cassette, are non-functional, most likely due to transcriptional or post-transcriptional DNA silencing [4]. To study these epigenetic features, we mapped short-read RNAseq, smallRNAseq, and bisulfite-converted DNA reads to each of the respective vector plasmids as reference. The high repetivity of the T-DNA insertions and the short reads utilized here made this necessary. Future studies can rely on sequencing a transcript in a single read, as well as extract DNA methylation information from kinetic information inherent to the long read nanopore data [40]. We are intrigued by the distinct mechanisms that result in silencing of the entire pROK2 element, but only partial silencing of the pCSA110 element. An increased GC content of a transgenic construct, above whole genome GC content, was shown to decrease transgene DNA methylation and siRNA production from the Agrobacterium -introduced transgene constructs [41]. In our study, all reporter genes had considerably higher GC levels ( bar 69%, nptII 60%, To avoid unfavorable outcomes, and make the best out of this invaluable resource, best practice measures such as those described in O'Malley and Barragan [43] should be taken into account. Last but not least, the T-DNA mutant seed stocks are highly heterozygous and still segregating, which was reflected in the optical maps and genome assemblies, where some insertions were missing, most likely due to lower allele frequency. In case of optical maps, we discovered a single instance of a phased genomic region, where two maps represent the wild type and the insertion allele (Fig. 2e). Although we did not observe phasing among the ONT contigs, we found several unassembled reads from T-DNA insertions absent from the ONT contigs, or that are not covered by optical maps and/or were not identified by TDNAseq (http://signal.salk.edu/cgi-bin/tdnaexpress). Existence of these insertions, such as SALK:chr3_20Mb and SALK:chr4_10Mb ( Supplementary Fig. 3), were confirmed through genotyping of the seed pool. Our utilization of these reads shows great potential for ONT sequencing for explorative low coverage sequencing of larger crop genomes to identify structural genome variation, that can then be further explored using more targeted methods.

Kanamycin resistant lines-SALK and Wisc
MS/MES media was prepared (1L NP H2O, 2.2g MS, 0.5g MES, pH5.7) and autoclave sterilized. Kanamycin (50ug/mL) was added to half of the media, and plates with and without the antibiotic were poured. Seeds from the lines SALK_059379 (homozygous), SAIL_232, WiscDsLox_449D11, and Col-0 CS70000 control were sterilized and spotted on the plates, which were placed in the same growth conditions as the plant material in soil. Survival rates were measured 5 days after seeds were spotted and growth was observed for further 14 days.

Finale resistant lines-SAIL
One flat each of Col-0 (control) and SAIL_232 were grown in soil as outlined in Plant Material . Plants were sprayed with Finale at three different concentrations (5mg/L, 10mg/L, 20mg/L) at 5, 14, and 21 days after germination [1,2].

GUS reporter staining-SALK and SAIL
Leaf and flower tissue was submerged in 1.5mL staining buffer (0.5mM ferrocyanide, 0.5mM ferricyanide, 0.5% Triton, 1mg/mL X-Gluc, 100mM Sodium Phosphate Buffer pH 7) and incubated at 37C overnight. The tissues were washed with decreasing concentrations of ethanol (100%, 80%, 65%, 50%). A light microscope was used to visualize the staining at 10x magnification for the leaves and 4x for the flowers.

MinION Nanopore Library Prep and Sequencing
5g of flash-frozen leaf tissue, pooled from the segregant seed stocks, was ground in liquid nitrogen and extracted with 20mL CTAB/Carlson lysis buffer (100mM Tris-HCl, 2% CTAB,
The ONT reads were also assembled with the CANU assembler [9]; The minimap/miniasm assemblies were however of higher contiguity and resolved longer stretches of the T-DNA insertions. In addition, the minimap/miniasm assemblies were computed within 4-6 hours, and in contrast the CANU assemblies took between one (Col-0) and four weeks (SAIL).
We hypothesize that the SAIL line inherent complex repeat structure caused the long compute time, while the reference Col-0 required only the expected time of one week. Also, we hypothesize that minimap/miniasm resolved the T-DNA structures more fully due to the fact it does not have a read correction step, which could lead to the collapsing of highly repetitive yet distinct T-DNA insertions.

Identification of individual T-DNA matching reads and annotation
Reads obtained via MinION sequencing were imported into Geneious R10 [10], along with the pROK2, and pCSA110 plasmid sequences. BLAST databases, were created for the complete set of SALK_059379 and SAIL_232 MinION reads, and their respective plasmid sequences were BLASTn searched against these databases. Reads with hits on the plasmid sequences were extracted, and BLASTn searched against the TAIR10 reference. NCBI megaBLAST [11] and Geneious 'map to reference' functions were used to annotate regions on each read corresponding to either TAIR10 or the plasmid sequence and the chromosomal regions were compared to the regions identified by TDNAseq and optical mapping to verify and refine T-DNA insertion start and stop coordinates.

Optical Genome Mapping
HMW DNA for optical genome mapping and nanopore sequencing was extracted as outlined in Kawakatsu [12]. Briefly, up to 5g of fresh mixed leaf and flower tissue (excluding chlorotic leaves and stems) pooled from the segregant seed stocks were homogenized in 50mL nuclei isolation buffer. Nuclei were separated from debris using a Percoll layer. Extracted nuclei were subsequently embedded in low melting agarose plugs and exposed to lysis buffer overnight. DNA was released by digesting the agarose with Agarase enzyme (Thermo Fisher Scientific).
Each A. thaliana T-DNA insertion line was run for up to 120 cycles on a single flow cell on the Irys platform (Bionano Genomics, San Diego, CA). Collected data was filtered (SNR = 2.75; min length 100kb) using IrysView 2.5.1 software, and assembled using default "small" parameters. We used this script to further determine misassemblies in the ONT genome assemblies. Known Col-0 mis-assemblies [12] were subtracted from the list of derived locations.

Aligning contigs to TAIR10
We aligned ONT contigs to the TAIR10 reference using the BioNano Genomics RefAligner as above (Supplementary Table 3). ONT contigs were in silico digested with Nt.BspQI and used as template in RefAligner. The alignment output was manually derived from the .xmap output file.

Genotyping of structural genome variations
Primer3 was used to create oligos (Supplementary Table 5

small RNA Libraries
We have extracted sRNA according to the protocol described in Vandivier [14] with modifications to the RNA extraction and size selection steps. In brief, 300mg flash-frozen leaf tissue from segregant plant pools was ground in liquid nitrogen and extracted with 700μL QIAzol lysis reagent (Qiagen, Hilden, D). The RNA was separated from the lysate using QIAshredder columns (Qiagen, Hilden, D) and purified with a ⅕ volume 24:1 chloroform:isoamyl alcohol Size selection and library prep were conducted exactly as described in Vandivier [14].
The RNA from 15-35bp was cut out of the gel and purified. Small RNA libraries were sequenced in duplicate as single end 150 bp in a multiplexed pool on two lanes of an Illumina HiSeq 2500.

Short read analysis
RNAseq and sRNA Illumina reads were adapter trimmed and mapped against the junction sequences as well as the corresponding plasmid sequences, using Bowtie2 and Geneious aligner (Geneious R10.2.3; custom sensitivity, iterate up to 5 times, zero mismatches or gaps per read, word length 14) at highest stringency. RNAseq reads were mapped against AtACT1 as control, and found this gene expressed. RNAseq reads were mapped against Pol V AT2G40030 as control, and found this gene not expressed. Resulting reads were extracted and re-mapped against the TAIR10 reference genome to ensure that no off-target reads mapped to the plasmid sequences. Because we analyzed various genomic insertions against a single reference, we did not normalize read mapping.
The MethylC-Seq data (paired-end) reads in FastQ format and the second pair reads were converted to their reverse complementary strand nucleotides. Then reads were aligned to the Arabidopsis thaliana reference genome (araport 11) and pCSA110/pROK2 genome. The Chloroplast genome was used for quality control (< 0.35% non-conversion rate). After mapping, overlapping bases in the paired-end reads were trimmed. The base calls per reference position on each strand were used to identify methylated cytosines at 1% FDR.

Data Availability
Raw