Genome-wide analysis of long noncoding RNAs, 24-nt siRNAs, DNA methylation and H3K27me3 marks in Brassica rapa

Long noncoding RNAs (lncRNAs) are RNA fragments that generally do not code for a protein but are involved in epigenetic gene regulation. In this study, lncRNAs of Brassica rapa were classified into long intergenic noncoding RNAs, natural antisense RNAs, and intronic noncoding RNAs and their expression analyzed in relation to genome-wide 24-nt small interfering RNAs (siRNAs), DNA methylation, and histone H3 lysine 27 trimethylation marks (H3K27me3). More than 65% of the lncRNAs analyzed consisted of one exon, and more than 55% overlapped with inverted repeat regions (IRRs). Overlap of lncRNAs with IRRs or genomic regions encoding for 24-nt siRNAs resulted in increased DNA methylation levels when both were present. LncRNA did not overlap greatly with H3K27me3 marks, but the expression level of intronic noncoding RNAs that did coincide with H3K27me3 marks was higher than without H3K27me3 marks. The Brassica genus comprises important vegetables and oil seed crops grown across the world. B. rapa is a diploid (AA genome) thought to be one of the ancestral species of both B. juncea (AABB genome) and B. napus (AACC) through genome merging (allotetrapolyploidization). Complex genome restructuring and epigenetic alterations are thought to be involved in these allotetrapolyploidization events. Comparison of lncRNAs between B. rapa and B. nigra, B. oleracea, B. juncea, and B. napus showed the highest conservation with B. oleracea. This study presents a comprehensive analysis of the epigenome structure of B. rapa at multi-epigenetic levels (siRNAs, DNA methylation, H3K27me3, and lncRNAs) and identified a suite of candidate lncRNAs that may be epigenetically regulated in the Brassica genus.


Introduction
The Brassica genus comprises vegetable and oil seed crops. The "Triangle of U" proposed the genomic relationship among six major species of the Brassica genus. Three allotetraploid species, each of which contains two complete diploid genomes derived from two different parental species, Brassica juncea L. ( [1]. Some species in the Brassica genus show morphological divergence (termed morphotype). B. rapa includes leafy vegetables such as Chinese cabbage (var. pekinensis), pak choi (var. chinensis), and komatsuna (var. perviridis), root vegetables including turnip (var. rapa), and oilseed crops (var. oleifera) [2]. The first whole genome sequence determined in the genus Brassica was that of B. rapa [3]. Later, whole genome sequences of B. oleracea, B. nigra, B. napus, and B. juncea were determined [4][5][6][7].
Previous studies have identified lncRNAs associated with different lines, tissues or environmental changes in Brassica [27][28][29][30][31][32][33]. However, there are few reports of the association between lncRNAs and epigenetic modifications in B. rapa. In this study, we examined the DNA methylation levels and H3K27me3 levels in the region covering lncRNAs and found an association between lncRNAs and inverted repeat regions (IRRs) or 24-nt siRNAs, and association between genes overlapping with lncRNAs and different epigenetic marks (DNA methylation and H3K27me3).
Seeds were surface sterilized and grown on agar solidified Murashige and Skoog (MS) plates with 1% (w/v) sucrose under long day (LD) condition (16h light / 8h dark) at 21˚C. Fourteenday first and second leaves of B. rapa and 19-day first and second leaves of B. oleracea were harvested for isolation of genomic DNA or total RNA. RJKB-T24 was used for RNA-sequencing (RNA-seq) for detection of lncRNAs. Briefly, after performing QC of the sequenced reads, putative mRNAs were identified by aligning the sequence reads to the B. rapa reference genome v1.5 using HISAT2 and then assembling transcripts with Stringtie. Assembled transcripts with a mapping code of 'u', indicating they are intergenic but not part of the annotated reference genome, were then compared to the Swis-sProt database using blastx (e-value 1e-10; [37]). Transcripts with hits were classified as putative mRNAs, while transcripts with no hits were classified as putative lincRNAs. A more detailed description of this method can be found in Shea et al, 2019 [27].

DNA extraction and PCR
Genomic DNA was isolated by the Cetyl trimethyl ammonium bromide (CTAB) method [38]. The PCR reaction was performed using the following conditions; 1 cycle of 94˚C for 3 min, 35 cycles of 94˚C for 30 s, 55˚C for 30 s, and 72˚C for 1 min, and final extension at 72˚C for 3 min. The PCR products were electrophoresed on 1.0% agarose gel. Primer sequences used in this study are shown in S1 Table. RNA extraction and RT-PCR Total RNA from the first and second leaves were isolated by SV Total RNA Isolation System (Promega Co., WI, USA). To analyze lncRNA expression, cDNA was synthesized from 500 ng total RNA using PrimeScript RT reagent Kit (Takara Bio., Shiga, JAPAN). The absence of genomic DNA contamination was confirmed by PCR using a control without reverse transcriptase. The PCR conditions were 94˚C for 2 min followed by 35 cycles of 94˚C for 30 s, 55˚C for 30 s, and 68˚C for 30 s. The primers used in this study are listed in S1 Table.

Detection of epigenetic states in lncRNA regions
To examine the epigenetic states (DNA methylation levels, H3K27me3, and 24-nt siRNA levels) of lncRNA coding regions in B. rapa, we used previous sequence data of whole genome bisulfite sequencing (WGBS) [39], chromatin immunoprecipitation sequencing (ChIP-seq) [40], and small RNA-sequencing (sRNA-seq) data [39], which were generated using samples from the same line, tissue, and developmental stages but harvested independently. Two biological replicates were used for all analyses.
The reads of WGBS were mapped to the B. rapa reference genome v.1.5 using Bowtie2 version 2.2.3 and Bismark v0.14.3, and data covering genomic regions encoding lncRNA in chromosomes A01 to A10 were extracted. In order to estimate the methylation levels of CG, CHG, and CHH (H is A, C, or T) contexts, the numbers of methylated and unmethylated reads were extracted for each cytosine position using bismark methylation extractor script with the paired-end parameter. The methylation level at each cytosine position was calculated by dividing the number of methylated cytosines (mC) reads by the total number of reads.
The reads of sRNA-seq were mapped to the B. rapa reference genome v.1.5 using Bowtie2 version 2.2.3. We classified the alignment reads by length, and the 24-nt aligned reads covering the lncRNA regions in chromosomes A01 to A10 were extracted.
The reads of ChIP-seq using anti-H3K27me3 (Millipore, 07-449) antibodies were mapped to the B. rapa reference genome v.1.5 using Bowtie2 version 2.2.3, and data covering genomic regions encoding for lncRNA in chromosomes A01 to A10 were extracted.

Comparison of putative mRNA and lncRNAs in B. rapa to other related species of Brassica
The putative mRNAs and lncRNAs were first compared to the reference genomes of B. nigra, B. oleracea, B. juncea, and B. napus by best-hit blastn (e-value 1e-10) to identify homologous regions in closely related Brassica species [6,7,41]. In order to parse the local High-scoring segment pair (HSP) alignments produced by blastn, genBlastA was used to produce a representative putative gene that is homologous to the query [42]. The parsed local alignments were then analyzed by a custom python script (available at http://www.github.com/danshea/ lncRNA) to examine the overall alignment length and computed coverage of the aligned homologous region to the putative mRNA or lncRNA query transcript sequence. These results were then imported into R and plotted to assess overall relative coverage of the sequences among the Brassica species using Simple plot and ggbio in the Bioconductor package [43].

Gene ontology (GO) analysis
The genes covering incRNAs or NATs were used for GO analysis. The incRNAs and NATs having DNA methylation, siRNAs, or H3K27me3 were identified and their corresponding genes were also used for GO analysis using agriGO [44] following the methods described by Shimizu et al, 2014 [45]. Statistical tests for enrichment of functional terms used the hypergeometric test and false discovery rate (FDR) correction for multiple testing to a level of 1% FDR.

Characterization of lncRNA in B. rapa
We identified 1,444 lincRNAs, 551 NATs, and 93 incRNAs using RNA-seq data of 14-day first and second leaves with and without four weeks of cold treatment [27]. In order to investigate the relationship between lncRNAs and epigenetic modifications or the species specificity of lncRNAs, we analyzed in more detail the RNA-seq data of the 14-day first and second leaves without cold treatment. There was no strong bias in the chromosomal distribution of expressed lncRNAs (S1 Fig). More than 65% of lncRNAs contained one exon (lincRNAs, 65.7%; NATs, 72.4%; incRNAs, 71.0%), whereas the proportion of mRNAs containing only one exon is 15.1% (Fig 1A). The mean transcript lengths of lincRNAs (725 nt) and incRNAs (779 nt) were shorter than that of NATs (1,271 nt) and mRNAs (1,305 nt) ( Fig 1B). About 40% of lincRNAs were located within 2 kb of the genic regions, and about 10% of lincRNAs were located more than 20kb from the genic region ( Fig 1C). In this study, we focused on lncRNAs mapped to the chromosomes A01 to A10 as previous WGBS, ChIP-seq, and sRNA-seq have omitted the placed scaffolds for their analyses [39,40]. 763 of 1,173 (65.0%) lincRNAs, 291 of 529 (55.0%) NATs, and 66 of 92 (71.7%) incRNAs overlapped with IRRs such as transposable elements (TEs) detected by RepeatMasker, suggesting that IRRs are the source of more than half of lncRNAs in B. rapa (Fig 2).

PLOS ONE
Analysis of long noncoding RNAs, siRNAs, DNA methylation and H3K27me in Brassica rapa

Relationship between lncRNAs and siRNA or DNA methylation
We have performed sRNA-seq using 14-day first and second leaves, which are identical developmental stages and tissues to previously analyzed samples [39], but independently harvested for this study. We identified the lncRNAs having perfect sequence identity to genomic regions encoding 24-nt siRNAs. 219 of 1,173 (18.7%) lincRNAs, 74 of 529 (14.0%) NATs, and 16 of 92 (17.4%) incRNAs in A01-A10 overlapped with unique-mapped genomic regions encoding 24-nt siRNAs, and more than 80% of the lncRNAs overlapping with genomic regions encoding 24-nt siRNAs were from regions that harbored IRRs (Fig 2). 24-nt siRNAs were mapped in a similar way to lncRNA and its 5' and 3' flanking regions; this mapping pattern is different from those of genic regions or IRRs (Fig 3).
We examined the whole genome DNA methylation state by WGBS using the same 14-day first and second leaves [39]. The average DNA methylation levels in regions covering lncRNAs were similar to those of the whole genome (S2 Fig). DNA methylation level in regions producing NATs was lower than those producing any of the three types of lncRNAs. Overlap of lncRNAs with IRRs or genomic regions encoding for 24-nt siRNAs was associated with increased DNA methylation levels and overlapping with both caused further increases in DNA methylation levels (Fig 4). We examined the influence on the lncRNA expression level when IRRs or siRNA clusters on the genome spanned the lncRNA regions. IRRs did not affect the expression levels of lncRNAs ( Fig 5). The expression levels of lncRNAs covered with siRNA clusters were higher than those not covered by siRNA clusters (Fig 5).

Relationship between lncRNAs and H3K27me3
We have examined the H3K27me3 distribution using the 14-day first and second leaf samples The expression level of incRNAs was higher when the encoding regions had H3K27me3 than without H3K27me3 (Fig 5). In NATs and lincRNAs, there was no difference of expression level with and without H3K27me3 on their encoding regions (Fig 5).

Characterization of un-annotated genes
In 2,052 lncRNAs mapped to intergenic regions of the genome, 608 transcripts had hits (evalue < 1.0e-10) against the Swissport database using BLASTX, indicating they could be unannotated genes. The expression levels of the putative mRNAs from these regions were similar

PLOS ONE
to that of mRNA [27]. Putative mRNAs tended to have fewer exons compared with annotated mRNAs (Fig 1A). The mean transcript length of putative mRNA genes was similar to that of mRNAs ( Fig 1B).
314 of 490 (64.1%) putative mRNAs in A01-A10 overlapped with IRRs (Fig 2) but the expression level of putative mRNAs overlapping with IRRs was similar to those not-overlapping with IRRs ( Fig 5). Mapping of 24-nt siRNAs and DNA methylation levels to genomic regions encoding putative mRNAs were similar to that of genic regions (Fig 3, S3 Fig). The average expression level of putative mRNAs with overlapping 24-nt siRNAs was higher than that without any overlap (Fig 5). The pattern of the average of DNA methylation level in the genomic region encoding putative mRNAs and their flanking regions was similar to that to the average of the genic regions (S3 Fig). Overlapping IRRs or 24-nt siRNAs resulted in an increase in DNA methylation levels and overlapping with both causes further increase in DNA methylation levels (Fig 4). 53 of 490 (10.8%) genomic regions corresponding to putative mRNAs had H3K27me3. H3K27me3 was enriched in the genomic regions encoding for lncRNAs, especially around the transcription start site, and this was similar to the pattern in the genic region (S5 Fig). The expression levels of putative mRNAs with corresponding genomic H3K27me3 were lower than those without H3K27me3, but there was no significant difference (Fig 5).

GO analysis of incRNAs and NATs and their relationship with epigenetic states
The paired genes overlapping with incRNA and NATs were subjected to GO analysis to assess whether there is any relationship between the gene ontology and the different epigenetic marks (Fig 6). GO category "transcription and DNA-dependent" and "metabolic process" were overrepresented for the incRNA data. For NATs, GO category "oxidation reduction", "transcription and DNA-dependent", and "carbohydrate metabolic process" were overrepresented.

PLOS ONE
Analysis of long noncoding RNAs, siRNAs, DNA methylation and H3K27me in Brassica rapa In relation to the paired genes that overlap with incRNA and DNA methylation, siRNA or H3K27me3 marks, DNA methylation was overrepresented for "transcription and DNAdependent". This indicates that the expression of these genes that overlap with incRNAs may be regulated through DNA methylation, and involved in transcription. Similarly, for NATs, H3K27me3 was overrepresented for "oxidation reduction" and "transcription and DNAdependent", genes overlapping with NATs and each epigenetic mark were overrepresented for "primary metabolic process" and "metabolic process", whereas siRNA was absent for "carbohydrate metabolic process". A detailed list of each GO analysis is presented in S2-S7 Tables.

Examination of the conservation of lncRNAs among the Brassica genus
Using the sequences of the 2,088 lncRNAs of B. rapa, a best-blast hit search against the B. nigra, B. oleracea, B. juncea, and B. napus reference genomes was conducted using GenBlastA (e-value = 1e-10) to examine the conservation of lncRNAs. B. rapa lncRNAs were most conserved in B. oleracea and moderate conservation was observed in B. nigra and B. napus. LncRNAs were less conserved in B. juncea, especially in the incRNAs (Fig 7).

PLOS ONE
Analysis of long noncoding RNAs, siRNAs, DNA methylation and H3K27me in Brassica rapa We selected twelve lncRNAs that showed high sequence similarity with the B. oleracea reference genome. We tested whether these lncRNA coding genomic sequences were conserved in three commercial cultivars of cabbage (B. oleracea) by PCR using genomic DNA as template. PCR amplification of genomic regions corresponding to all twelve lncRNAs was confirmed in all three B. oleracea lines. Next, we tested by RT-PCR whether these putative lncRNAs in B. oleracea are expressed in the first and second leaves (S6 Fig). Six of twelve lncRNAs were expressed in all three cultivars. One lncRNA was expressed in two of three lines, and one lncRNA was expressed in one of three lines. The remaining four lncRNAs were not expressed or were slightly expressed in all three lines (Table 1). We also examined the variation within B. rapa species using six lines. Seven of the twelve lncRNAs were expressed in all six lines, and the remaining five lncRNA showed line-specificity ( Table 1). The sequences of the twelve lncRNA are listed in (S8 Table). Fig 7. BLAST search of each type of lncRNA from B. rapa against B. nigra, B. oleracea, B. juncea, and B.

PLOS ONE
Analysis of long noncoding RNAs, siRNAs, DNA methylation and H3K27me in Brassica rapa

Discussion
We analyzed 2,088 lncRNAs in leaves of B. rapa; their characteristics such as number of exons, length of transcripts (except for NATs), and lower expression levels were similar to those reported in other plant species [46][47][48][49][50][51]. The total number of lncRNAs seems to be similar among species [52,53], but there tends to be low sequence conservation between different species [54]. Lower conservation of lncRNAs than mRNAs was observed between B. napus and its two ancestral species, B. rapa and B. oleracea [29,48], even though the hybridization was relatively recent event,~1,910-7, 180 years ago [55]. We also found a lower sequence conservation of B. rapa lncRNAs among other species of the genus Brassica (B. nigra, B. oleracea, B. juncea, and B. napus). In 12 selected highly conserved lncRNAs, we found conservation of the lncRNAs at both the sequence and transcriptional level between B. rapa and B. oleracea, but there was transcriptional variation within species agreeing with the IRR diversification of each species after the allotetraploidization of B. napus [56]. The analysis of sequence homology of lncRNAs in B. rapa with other members of the Brassica genus may deepen the understanding of the evolutionary dynamics of lncRNAs in the genus Brassica. We found over 55% of each lncRNAs (65.0% lincRNAs, 55.0% NATs, and 71.7% incRNAs) overlapped with IRRs including TEs throughout the B. rapa genome. TEs are considered a source of siRNAs [57], and lincRNAs identified in A. thaliana, rice, and maize, were associated with 22.2%, 49.7%, and 51.5% of TEs, respectively [58]. In this study, we found 18.7% lincR-NAs, 14.0% NATs, and 17.4% of incRNAs covered unique-mapped 24-nt siRNAs, and over 80% of lncRNAs covering the genomic regions encoding for 24-nt siRNAs overlapped with IRRs with perfect sequence match, suggesting that lncRNAs covering IRRs are also the likely source of siRNAs in B. rapa. For mRNA, the average expression levels of lncRNAs having 24-nt siRNAs was higher than that of lncRNAs not having 24-nt siRNAs [39]. LncRNAs and 24-nt siRNAs are known to be involved in the increase of DNA methylation, which agrees with reports of increased gene expression when DNA methylation overlaps with an exonic region [59]. The detailed mechanism is not clear, but our results also agrees that lncRNAs may be involved in this gene regulation mechanism.
The epigenetic functions of different lncRNAs have been identified in diverse organisms [21,22], but there is limited information in B. rapa. LncRNAs show a close association with DNA methylation of the genomic region encoding the lncRNA [60], but studies have focused on the model plant species, A. thaliana [11,61]. In this study, we found a similar level of DNA methylation in regions encoding lncRNA compared to previously described DNA methylation in the B. rapa genome [39]. Levels of DNA methylation can be positively regulated by siRNAs through the RdDM pathway [62], and Pol IV is important for biogenesis of 24-nt siRNAs [63]. Identification of long Pol IV-dependent transcripts is difficult because these transcripts can be rapidly processed to produce 24-nt siRNAs. There are several reports that detected Pol IVdependent siRNA-precursor transcripts in A. thaliana of different lengths [63][64][65]. Most Pol IV-dependent transcript regions overlapped with Pol IV-dependent siRNA loci, and CHH methylation depends on the production of Pol IV-dependent transcripts [65]. However, if Pol IV-dependent transcripts are short as 30-40 nucleotides [63,64], the lncRNAs that we identified are much longer and not likely to be Pol-IV dependent transcripts that act as siRNAprecursors.
Pol V transcripts accumulate at very low levels and have been difficult to identify. However, using RNA immunoprecipitation, Pol V-dependent lncRNAs were identified, and CHH methylation and 24-nt siRNA accumulation were shown to be restricted to Pol V transcribed regions [66]. In this study, about 13% of lncRNAs overlapped with genomic regions encoding 24-nt siRNAs, and the DNA methylation level of those regions was higher than the average. It has been reported that the median length of Pol V transcripts was 689 nt [66]. The length of the lncRNAs overlapping genomic regions encoding for 24-nt siRNAs identified in this study resembles the length of the Pol V transcript precursors.
The level of H3K27me3 plays a role in tissue specific gene expression [67]. LncRNAs are also considered to be involved in regulating histone modifications. Binding of lncRNAs to PRC2 has been observed in human/animal [68,69], and COLDAIR and COLDWRAP have been shown to recruit the PRC2 complex to FLC during vernalization in A. thaliana [13,14,70,71]. EMF2B is a component of the PRC2 complex and in rice, a mutant of emf2b was reported to lose H3K27me3 and derepress some of the lincRNAs, suggesting that expression of these lincRNAs is regulated by PRC2-mediated histone methylation of the region encoding the lncRNA [72]. We found that about 10-16% of lncRNAs overlapped with H3K27me3 regions, which is lower than the overlap with mRNAs. The pattern of H3K27me3 in lncRNA overlapping regions was similar to mRNAs; H3K27me3 accumulated in the body region of transcripts, especially around the transcription start site. However, there was no negative relationship between the presence of H3K27me3 and lncRNA expression levels, and incRNA coding regions having H3K27me3 marks showed higher expression levels than incRNA coding regions without H3K27me3 marks. H3K27me3 over lncRNA coding regions was also found in different tissues of maize, which was responsible for the regulation of tissue-specific lncRNAs expression [73]. Coincidentally, DNA methylation in intronic regions is known to influence splicing patterns [74], which may reflect epigenetic tissue specific gene regulation involving RdDM and H3K27me3. However, as our study focused on the analyses in a single tissue/developmental stage, we could not identify the lncRNAs that are transcriptionally silenced by H3K27me3, leading to the underestimation of lncRNAs with H3K27me3 marks. Further study will be required to examine the transcriptional regulation of lncRNAs by H3K27me3.
GO analysis of the genes that overlap with incRNA and NATs loci revealed that both were overrepresented in the "transcription and DNA-dependent" category. For incRNAs, all five genes were annotated as transcription factors/regulators but the function of these genes remains unknown without further analyses. However, for genes overlapping with NATs and DNA methylated loci, three of the six identified genes were annotated as beta-galactosidase that are known to be involved in pollen development and fertilization [75,76]. Interestingly, annotated genes enriched for the "transcription and DNA-dependent" category in "NATs" contained WRKY27 and WRKY62 genes that are related to pathogen defense [77,78]. Furthermore, a gene (MAF4) known to be involved in flowering regulation that is regulated by NATs in Arabidopsis [79] was also identified, indicating that this NAT may be regulated by DNA methylation and involved in flowering regulation in Brassica. However, further analyses such as deletion of the incRNA/NAT locus or manipulating their expression will be needed to understand the function of these lncRNAs in Brassica.
This study revealed that a small proportion of lncRNAs in B. rapa are conserved with other Brassica species. The majority of lncRNAs in B. rapa overlap with IRRs and there is some overlap with DNA methylation and 24-nt siRNAs, hinting at regulation of lncRNA expression through the RdDM pathway. Interestingly, some of the lncRNAs that overlapped with the genomic regions having H3K27me3 marks were more highly expressed, which was unexpected because of the known gene suppression activity of H3K27me3. This may indicate an unknown regulatory mechanism of lncRNAs by H3K27me3. However, the current study focuses on the genome-wide quantification of lncRNAs, 24-nt siRNAs, DNA methylation, and H3K27me3, and further quantitative analysis at the locus specific level may reveal a clearer relationship between these marks.
Epigenetic state of the genome (including DNA methylation and H3K27me3, expression of lncRNAs and 24-nt siRNAs) can dynamically change depending on the stage of development, biotic and abiotic stress conditions, and the expression of lncRNAs can also change depending on the tissue or environment and influence gene expression [80]. This study focused on the association of lncRNAs and epigenetic marks, but to understand the association of lncRNAs with agronomical traits, further exploration of lncRNAs under different tissues, environmental conditions, and cultivars (or varieties) will be needed.