Common alleles of CMT2 and NRPE1 are major determinants of de novo DNA methylation variation in Arabidopsis thaliana

DNA cytosine methylation is an epigenetic mark associated with silencing of transposable elements (TEs) and heterochromatin formation. In plants, it occurs in three sequence contexts: CG, CHG, and CHH (where H is A, T, or C). The latter does not allow direct inheritance of methylation during DNA replication due to lack of symmetry, and methylation must therefore be re-established every cell generation. Genome-wide association studies (GWAS) have previously shown that CMT2 and NRPE1 are major determinants of genome-wide patterns of TE CHH-methylation. Here we instead focus on CHH-methylation of individual TEs and TE-families, allowing us to identify the pathways involved in CHH-methylation simply from natural variation and confirm the associations by comparing them with mutant phenotypes. Methylation at TEs targeted by the RNA-directed DNA methylation (RdDM) pathway is unaffected by CMT2 variation, but is strongly affected by variation at NRPE1, which is largely responsible for the longitudinal cline in this phenotype. In contrast, CMT2-targeted TEs are affected by both loci, which jointly explain 7.3% of the phenotypic variation (13.2% of total genetic effects). There is no longitudinal pattern for this phenotype, however, because the geographic patterns appear to compensate for each other in a pattern suggestive of stabilizing selection. Author Summary DNA methylation is a major component of transposon silencing, and essential for genomic integrity. Recent studies revealed large-scale geographic variation as well as the existence of major trans-acting polymorphisms that partly explained this variation. In this study, we re-analyze previously published data (The 1001 Epigenomes), focusing on de novo DNA methylation patterns of individual TEs and TE families rather than on genome-wide averages (as was done in previous studies). GWAS of the patterns reveals the underlying regulatory networks, and allowed us to comprehensively characterize trans-regulation of de novo DNA methylation and its role in the striking geographic pattern for this phenotype.

there are three types of DNA methylation contexts: CG and CHG, both of which are 5 symmetric, and CHH, which is not (H is A, T, or C). CG-methylation (mCG) and 6 CHG-methylation (mCHG) can be maintained in a semi-conservative manner during 7 DNA replication by DNA METHYLTRANSFERASE 1 (MET1 ) and 8 CHROMOMETHYLASE 3 (CMT3 ), respectively, whereas CHH-methylation (mCHH) 9 must be re-established every cell generation, presumably by one of two de novo 10 pathways, one involving CHROMOMETHYLASE 2 (CMT2 ), the other RNA-directed 11 DNA methylation (RdDM) [1][2][3]. CMT2 preferentially methylates heterochromatic 12 non-CG cytosines that are marked by H3 Lys9 (H3K9) di-and tri-methylation [4,5], 13 whiles RdDM involves small RNAs that recruit DOMAINS REARRANGED 14 METHYLTRANSFERASE 2 (DRM2 ) to target regions throughout the genome [6,7]. 15 These pathways thus have separate target sites [4] and establish the genome-wide DNA 16 methylation landscape in combination with maintenance and de-methylation pathways. 17 Natural variation for DNA methylation, superficially similar to DNA sequence 18 polymorphism, is abundant in Arabidopsis [8,9]. Although much of this variation likely 19 reflects local sequence variation (e.g. segregating TE insertions), recent studies have 20 revealed that a substantial part of the variation is controlled by trans-acting loci with 21 genome-wide effects [10][11][12]. Understanding these trans-regulators is essential for 22 understanding the genome-wide pattern of methylation variation, and could provide 23 important clues to the function of DNA methylation. 24 The present study builds on previous results to comprehensively characterize 25 trans-regulation of mCHH and its role in the striking geographic pattern for this 26 phenotype. We achieve this by looking for genotype-phenotype associations at the level 27 of individual TEs or TE families rather than genome-wide averages. As we shall see, 28 this makes a huge difference. 29

30
Major trans-regulators of mCHH levels 31 We first characterized average mCHH profiles of 303 TE families in each individual (S1 32 Tablel). Clustering analysis (based on the pattern across the 774 individuals) identified 33 four groups, with the largest two roughly corresponding to the TE families that were 34 previously shown to lose mCHH in RdDM and CMT2 pathway mutants ( Fig. 1 [13]). 35 The group corresponding to the RdDM pathway is mostly class I TEs and is enriched 36 with RC/Helitron and DNA/MuDR, whereas the group corresponding to the CMT2 37 pathway is dominated by class II TEs and is enriched with LTR/Copia and LTR/Gypsy 38 (note that targeting also strongly depends on element length and genome location; see RNA-polymerase V responsible for the RdDM pathway [7] and at chr1:17895231 in the 46 promoter region of ARGONAUTE1 (AGO1 ) recruits 21nt small RNA [14]. The 47 remaining seven peaks are associated with the group corresponding to the CMT2 48 pathway with very strong signals at chr4:10417744 and chr4:10422486 in the coding or 3' 49 region of CHROMOMETHYLASE2 (CMT2 ) [5] (Fig. 1).

50
The pattern of natural variation in mCHH is thus sufficient to outline pathways 51 previously painstakingly discovered using traditional genetic screens, as well as to 52 identify some of the major genes involved.

53
In addition to known genes, nine clear peaks suggest undescribed regulators of DNA 54 methylation (S2-3 Tables). For example, the peak at chr1:27261944 is in the promoter 55 region of a gene coding a DNAJ domain (At1g72416) that is a common component of 56 DNA methylation reader comlex [15], and the peak at chr4:9595111 is upstream of a 57 histone H3K4-specific methyltransferase SET7/9 family gene (At4g17080) implicated in 58 histone modification.

59
The four peaks that correspond to obvious a priori candidates are consistent with 60 previous results [11,12]. The peak near AGO1 identifies the same top SNP as 61 Kawakatsu et al [12, ] while the remaining three are in strong LD, but are much closer 62 to the respective candidate genes, presumably because the present analysis, focusing on 63 TE families rather than on average methylation levels, has higher resolution ( Fig. 2A). 64 Thus chr2:16719071 is in LD with the previously identified chr2:16724013 [12], but is in 65 the coding region of NRPE1, where it is LD with 12 non-synonymous polymorphisms 66 and a three bp in-frame indel in the RNA polymerase domain. Similarly, chr4:10417744 67 is in LD with chr4:10454628 CMT2b (see [11]), but inside the coding region of CMT2 68 and tagging two non-synonymous SNPs in the DNA methylase domain as well as a 69 twelve base-pair deletion in the first exon. Finally, chr4:10422486 is in LD with 70 chr4:10459127 CMT2a (see [11]), which is still outside the coding region, but 71 presumably in the regulatory region.

72
For clarity, we will refer to the newly identified associations as NRPE1', CMT2b' 73 and CMT2a'. The non-reference NRPE1' allele is associated with decreased mCHH 74 levels, whereas the non-reference alleles of CMT2b' and CMT2a' have negative and 75 positive effects, respectively, in agreement with previous results [11].

76
GWAS for mCHH levels of 9,228 individual TEs that are present in all 774 lines 77 showed a very similar pattern to GWAS for individual TE families. Although the AGO1 78 peak was much weaker, the signals at NRPE1', CMT2b', and CMT2a' remain strong 79 even at the level of individual TEs, with NRPE1' explaining 6.6% of the average mCHH 80 variation on RdDM-targeted TEs, whereas the two CMT2 alleles each explain about 4% 81 (total 6.4%) of the variation on CMT2-targeted TEs (Fig. 2B). Because the effect sizes 82 are so large, and because the genes target different chromosomal regions (NRPE1 83 mainly affects TEs in chromosome arms, whereas CMT2 targets TEs in pericentromeric 84 regions; see Fig. 2B), these polymorphisms contribute substantially to shaping the 85 genome-wide landscape of mCHH levels (Fig. 2C), and the remainder of this paper will 86 focus on them.  The heat map shows GWAS results for 303 TE families (each row is a family; columns indicate positions in genome; blue is more significant). The Manhattan plot on top shows integrated p-values from combining results across families (using X 2 -statics). The horizontal line in the Manhattan plot gives FDR 20% threshold, with significant associations shown in yellow (see Methods, S1 Fig.). Arrows indicate previously identified associations [11,12] also identified here. TE-families (rows) have been clustered based on average mCHH levels for 774 lines. The tip colors of the resulting tree correspond to TE superfamilies, and the superfamily composition for the four large clusters (Groups I-IV) is summarized by pie charts on the left. The greenish bars on the right show the reduction in mCHH levels of each TE family in drm1 drm2 (RdDM pathway) and cmt2 (CMT2 pathway) loss-of-function lines.   Causality of NRPE1 and CMT2 alleles 88 Identifying the causal polymorphisms underlying a GWAS peak is notoriously 89 difficult [17,18]. However, because the phenotypes associated with the polymorphisms 90 just described are so specific (multi-dimensional mCHH on hundreds or even thousands 91 of specific TEs throughout the genome), it is possible to confirm the causal involvement 92 of genes by comparison to mutant phenotypes. Specifically, we compared the estimated 93 allelic effects of NRPE1', CMT2b', and CMT2a' on 9,228 TEs with the effects of

4/17
CMT2b' Scatter plots show correlations of differential mCHH levels (DML) induced by alleles and mutants for each TE. DML for alleles was estimated as average differences of mCHH levels between lines carrying reference and non-reference alleles, whereas for mutants it was estimated between wild-type and the nrpe1-11 or cmt2 loss-of-function. Colors of dots in the scatter plots show the significance of the allelic effects as -log 10 p-value in GWAS. Density plots on Y and X-axis show distributions of the allelic effects for TEs. (B) Manhattan plots of cis peaks for CMT2 expression (n=665; leaf tissue under 21ºC) and effects of CMT2 alleles. Horizontal lines show the threshold (p-value 5% Bonferroni correction), and identified SNPs in meta-analysis for mCHH variation of TE families were labeled (FDR < 20% ). Boxplot shows CMT2 expression of lines carrying reference or CMT2a' alleles. * * * indicates p-value < 0.01 (Welch's t-test).
Furthermore, the phenotypic correlation between CMT2b' and cmt2 was much 101 stronger than the correlation between CMT2b' and any other gene knockout (Fig. 4), 102 effectively confirming the causal role of CMT2 -the alternative explanation would be 103 that the identified non-synonymous polymorphisms in CMT2 affect methylation via a 104 closely linked unidentified gene that mimics the highly specific phenotypic effects of 105 CMT2 much better than any of the 85 other analyzed genes in these well-studied 106 pathways. The correlation between the effects of CMT2a' and cmt2 is notably weaker, 107 perhaps because this allele affects expression like a moderate overexpressor (Figs 3B, 4). 108 This may be worth exploring further.
RdDM pathway CMT2 pathway Others  phenotype similar to knocking out NRPE1 [19], rather than by somehow regulating an 115 unknown member of the RdDM pathway (S5-6 Figs). The relative lack of specificity of 116 NRPE1 can also be seen from the comparison of natural alleles and knock-out 117 mutations. Whereas variation at CMT2 affects only a subset of TEs, variation at 118 NRPE1 affects all TEs, albeit to different extents (Fig. 3). 119 In summary, we feel confident that both the CMT2 and NRPE1 alleles involve 120 cis-acting polymorphisms that affect the phenotype via the corresponding genes. How 121 this is done is of course not clear, but we note again that both NRPE1' and CMT2b' 122 are associated with multiple non-synonymous SNPs, and that CMT2a' is associated 123 with increased CMT2 expression (Fig. 3). Note that the same analysis does not work  The natural alleles thus show similar patterns to knock-out mutants, albeit with some 127 notable differences. CMT2b' preferentially affects the same TEs as cmt2 regardless of 128 whether we consider the most significant or the largest effects (Figs 3-4, and S5).  This difference in specificity could be due to difference in target specificity between 135 these alleles, but may also be explained by the population dynamics of TEs, because it 136 turns out nrpe1-11 strongly affects TE-superfamilies that have relatively low frequency 137 in the population (like RathE3 cons and SINE; see S1,4 Tables (Figs 1, 2). However, as noted above, variation at NRPE1 clearly affects 144 methylation of CMT2-targeted TEs, whereas the converse is not true (Figs 3 and S9; 145 p-value < 0.01). 146 We examined the joint allelic effect of NRPE1' and CMT2b' or CMT2a' on mCHH 147 levels (Fig. 5A). mCHH levels on the RdDM-targeted TEs are primarily decided by  Table S3), mCHH levels of 154 CMT2-targeted TEs are well predicted by both loci (S9 Fig, S3 Table). The role of the 155 RdDM pathway on the establishment of DNA methylation in CMT2-targeted TEs has 156 been studied [4], and it appears to work on the edges of long TEs only (as shown in 157 cmt2; see Fig. 5A). In contrast, the effect of the natural allelic variation at NRPE1' 158 allele was observed over the entire TE, including the body. This suggests a qualitative 159 difference between the natural alleles and the knock-out allele.

160
In summary, genotypes of NRPE1 and CMT2 generate further diversity of mCHH 161 status over the genome. Given that both loci affect the pattern of methylation on 162 CMT2-targeted TEs, it is worth noting that the allele frequencies at these two loci are 163 strongly correlated. In particular, the genotype NRPE1'/CMT2b', which maximally 164

175
At NRPE1, the alternative allele is essentially only found in the east, and this is the 176 cause of a longitudinal cline in mCHH methylation on NRPE1 -targeted TEs (r 2 =0.024, 177 p-value=3.0e-05 vs r 2 =0.002, p-value=0.26 after regressing out NRPE1' ) even after  At CMT2, both alternative alleles are limited to Europe, where they appear 180 intermingled, but this causes no longitudinal cline for mCHH on CMT2-targeted TEs as 181 the alleles have opposite effects relative to the reference allele (r 2 =0.001; p-value=0.39; 182 Figs 6, S11). The distribution of NRPE1' alleles, which also affect CMT2-targeted TEs 183 contributes to the lack of a longitudinal pattern (p-value=0.03), consistent with the 184 observation above that selection may be acting to stabilize methylation. 185

186
In this paper we re-analyze the 1001 Epigenomes [12], focusing on mCHH patterns on 187 individual TEs and TE families rather than on genome-wide averages performed in 188 previous studies [10][11][12]. The advantages of this approach are evident. First, we were 189 able to identify the well-known RdDM and CMT2 pathways using only natural 190 variation data. We also identify several new associations, presumably corresponding to 191 previously unknown members of these extensively-studied pathways (Figs 1-2). Second, 192 the use of more fine-grained phenotypes allowed to refine previous associations, 193 identifying candidate causal polymorphisms in both CMT2 and NRPE1 (Fig. 2). 194 Furthermore, by comparing the genome-wide mCHH pattern with published data for 195 loss-of-function mutations [13,19], we were able to establish the causal involvement of 196 these genes (Figs 3-4).

197
In terms of molecular mechanisms, our results largely confirm and complement 198 previous studies [4,5,20]. The natural alleles of CMT2 and NRPE1 functionally behave 199 much like loss-of-function alleles, albeit with some interesting differences that deserve 200 further study. It is worth emphasizing in this context that these natural alleles have 201 large effects, and are amenable to experimental studies. Perhaps because we are dealing 202 with functional alleles, perhaps because we average over hundreds of lines, we get very 203 clear pictures of which TEs are targeted by which de novo pathway (Figs 1 and S2).

204
The mechanism underlying this targeting and the transition between pathways still 205 remains unclear despite considerable effort.

206
Analysis of active TEs might be informative from this point of view. The current 207 10/17 study is limited to TEs annotated in the reference genome, and present at high 208 frequency in the population. New TE insertions are likely to generate DNA methylation 209 diversity [21] but analysis of this will have to await long-read genome sequencing of 210 many lines, which will let us capture rare insertions, and study de novo 211 silencing. [12,22,23]. 212 Finally, we confirm the existence of major trans-acting polymorphisms affecting de 213 novo DNA methylation [11,12]. Based on currently available GWAS results, a genetic 214 architecture characterized small numbers of genes of large effect is highly unusual, and 215 is typically associated with adaptive polymorphism [24], but we can only speculate 216 about what the adaptive value of variation in TE methylation would be. However, the 217 idea of trade-offs and arms-races in a "genomic immune system" is not ridiculous -218 such mechanisms clearly maintain polymorphism in other defense systems [25]. The 219 geographic pattern observed here, with linkage disequilibrium between unlinked loci 220 (Fig. 6), is certainly suggestive of selection. Clustering of TE families was conducted based on average mCHH levels across 774 lines 236 (Fig. 1). The values were transformed into rank order per line and analyzed by hclust 237 function with R (https://www.r-project.org/), with the agglomerative method 238 'complete'. All other clustering analyses were conducted with raw values as described in 239 results using hclust function with default settings. Omnibus (GSE80744) and transformed it into the most normal by Box cox method.

247
GWAS was performed using a linear mixed model [28,29] by LIMIX [30] with a full 248 genome SNP matrix from the 1001 genome project (10,709,949 SNPs), and population 249 structure was corrected by IBS matrix. A linear model without correction of population 250 structure was conducted using lm function in R (https://www.r-project.org). SNPs that 251 satisfied minor allele frequency (MAF) > 5% were used for association studies. 252

11/17
Meta-analysis 253 To combine p-values for each SNP calculated by GWAS, we used Fisher's methods as 254 the following formula [31].
where p i is p-value for ith GWAS, and k is the number of GWAS in the 256 meta-analysis. X 2 follows X 2 distribution with 2k degrees of freedom. To optimize the 257 threshold, we calculated false discovery rate (FDR) using the enrichment test with a 258 priori gene list of 79 epigenetic regulators as described in [12]. The most significant 259 p-value within 15 kb of a gene (MAF > 5%) was assigned as the significance of the gene. 260 LD (r 2 ) were calculated between all pairs of SNPs satisfied with the FDR threshold 261 to determine independent GWAS peaks. In the case that a SNP pair has high LD 262 (r 2 >0.2), a SNP having lower X 2 scores were excluded from the list.     Table   451 GWAS results for average mCHH of TE families 452 S2 Table   453 Top SNPs associated with mCHH variation (FDR20)  Table   455 Genetic effects on mCHH variation 456 S4 Table   457 Compositions of TE superfamilies in Col-0 reference and it of common in 458 the population (n=774) 459