Complex Interplay among DNA Modification, Noncoding RNA Expression and Protein-Coding RNA Expression in Salvia miltiorrhiza Chloroplast Genome

Salvia miltiorrhiza is one of the most widely used medicinal plants. As a first step to develop a chloroplast-based genetic engineering method for the over-production of active components from S. miltiorrhiza, we have analyzed the genome, transcriptome, and base modifications of the S. miltiorrhiza chloroplast. Total genomic DNA and RNA were extracted from fresh leaves and then subjected to strand-specific RNA-Seq and Single-Molecule Real-Time (SMRT) sequencing analyses. Mapping the RNA-Seq reads to the genome assembly allowed us to determine the relative expression levels of 80 protein-coding genes. In addition, we identified 19 polycistronic transcription units and 136 putative antisense and intergenic noncoding RNA (ncRNA) genes. Comparison of the abundance of protein-coding transcripts (cRNA) with and without overlapping antisense ncRNAs (asRNA) suggest that the presence of asRNA is associated with increased cRNA abundance (p<0.05). Using the SMRT Portal software (v1.3.2), 2687 potential DNA modification sites and two potential DNA modification motifs were predicted. The two motifs include a TATA box–like motif (CPGDMM1, “TATANNNATNA”), and an unknown motif (CPGDMM2 “WNYANTGAW”). Specifically, 35 of the 97 CPGDMM1 motifs (36.1%) and 91 of the 369 CPGDMM2 motifs (24.7%) were found to be significantly modified (p<0.01). Analysis of genes downstream of the CPGDMM1 motif revealed the significantly increased abundance of ncRNA genes that are less than 400 bp away from the significantly modified CPGDMM1motif (p<0.01). Taking together, the present study revealed a complex interplay among DNA modifications, ncRNA and cRNA expression in chloroplast genome.


Introduction
Salvia miltiorrhiza, also known as Chinese sage, tan shen, or danshen, is highly valued in traditional Chinese medicine for its roots or rhizomes [1]. S. miltiorrhiza extracts are the main components of two of the ten best-selling traditional Chinese medical products, each with a total sale of around 100 million US dollars. To date, more than 70 active components have been isolated from S. miltiorrhiza [2,3]. These active ingredients can be classified into two groups: lipid-soluble tanshinones and watersoluble phenolic acids. Compounds from both groups exhibit antioxidant, antitumor, anti-inflammatory, and antimicrobial properties [1].
Given the pharmacological and economic importance of the active components of S. miltiorrhiza, their overproduction remains an active research area. Recent studies have shown that chloroplasts are ideal hosts for the expression of genes related to the production of secondary metabolites [4] because these organelles support (1) the accumulation of large amounts of protein; (2) transgene stacking because of the presence of polycistron; (3) proper protein folding; (4) site-specific integration; and (5) maternal inheritance, which prevents the dissemination of transgenes through pollen. However, a detailed understanding of the chloroplast genome structure, gene contents, and gene expression regulation is a prerequisite for the development of effective chloroplast-based transgenic systems.
Chloroplasts are plant-specific organelles for photosynthesis, starch storage, nitrogen metabolism, sulfate reduction, fatty acid synthesis, DNA and RNA synthesis, and the synthesis of 50% of soluble protein in leaves [5]. By March 2014, 442 complete Viridiplantae chloroplast genomes are deposited in GenBank. Comprehensive reviews have been conducted on the (1) structure and biology of chloroplast genomes [6]; (2) RNA metabolism and translation regulation [7]; and (3) the coordination of gene expression between organellar and nuclear genomes [8]. Although the S. miltiorrhiza chloroplast genome has been recently assembled and analyzed [9], the expression profile, as well as the presence and functions of noncoding RNA (ncRNA), and DNA modifica-tions have not been studied in depth. Thus, the current study aims to clarify these issues.
ncRNAs are RNAs that do not encode proteins, but have multiple functions. ncRNAs can be classified into two broad categories: house-keeping and regulatory. House-keeping ncRNA, which include ribosomal, small nuclear, small nucleolar, and transfer RNA (tRNA), is constitutively expressed [10]. On the other hand, regulatory ncRNA are often expressed during specific developmental stages or under particular nutritional or environmental conditions [11]. One ncRNA subset, called natural antisense transcripts (NATs, also called asRNA), are endogenous RNA molecules with regions complementary to a sense transcript (cRNA). Through base-pairing with their targets, these asRNA elicit translational inactivation/activation, mRNA stabilization/ destabilization, or differential transcription termination [12,13].
Previous studies have identified dozens of ncRNA in chloroplasts [10,14]. A number of studies have elucidated the regulatory roles of chloroplast ncRNA. For the first example, an ndhB asRNA covers two editing sites of the ndhB gene and a group II intron splice acceptor site, and might play important role in RNA maturation or stability [14]. For the second example, two different antisense RNAs of psbT gene were found to form double-stranded RNA/RNA hybrids, which results in translational inactivation of the psbT mRNA. The hybrid was further hypothesized to provide protection against nucleolytic degradation of mRNA during photo-oxydative stress conditions [15]. For the third example, antisense RNA of rbcL and atpB genes form stable double-stranded molecules with sense strand transcripts and prevents their translation [16]. For the forth example, the expression of complementary RNA of RpoB-2 from chloroplast transgenes affects editing efficiency of transgene and other endogenous chloroplast transcripts [17]. For the last example, overexpression of AS5, a chloroplast-encoded asRNA complementary to the 5S rRNA in tobacco, destabilizes 5S rRNA and retards plant growth [18]. These findings suggest that the ncRNA in the chloroplast genome has various regulatory roles. Therefore, identifying and characterizing the ncRNA in the S. miltiorrhiza chloroplast genome would help reveal the functions of this important group of gene expression regulators.
DNA contains a large variety of functionally important modifications [19]. The most common modification, DNA methylation, involves the addition of a methyl group to cytosine or adenine DNA nucleotides by methyl transferase (MTase). The most common methylation types are N6-methyladenine (m6A), N4-methylcytosine (m4C), and 5-methylcytosine (m5C) [20]. In plants, DNA methylation is essential for growth and development, and it affects gene expression, genomic imprinting, heterochromatin assembly, and protection of the genome against migrating transposable elements [21,22]. It should be pointed out that most studies on the effect of DNA modifications on gene expression focus on the nuclear DNAs. For chloroplast DNAs, Ahlert et al. [23] observed the insensitivity of plastid gene expression to both adenine and cytosine methylation of the plastid DNA by expressing two cyanobacterial DNA methyltransferase genes from the tobacco plastid genome. However, whether or not these observations reflect the natural stages of DNA modifications in chloroplast genomes is unknown. As a result, identification of potential DNA modification sites in the S. miltiorrhiza chloroplast genome would provide valuable information on the epigenetic regulation of its gene expression.
In this study, we used the Single-Molecule Real-Time (SMRT) Sequencing and strand-specific RNA-Seq technologies to characterize in detail the S. miltiorrhiza chloroplast genome, transcriptome, and DNA modifications. Detailed analyses of DNA modifications as well as of ncRNA and cRNA expression suggest a complex interplay between these processes. The results of this study would provide valuable information for the construction of chloroplast-based transgenic systems, as well as lay the foundation for research on ncRNA and DNA modifications in regulating the expression of coding genes in chloroplasts from other organisms.

Materials and DNA sequencing
Fresh leaves from three plants of S. miltiorrhiza Bunge cv 99-3 were collected from the Institute of Medicinal Plant Development, Beijing, China. The leaves are pooled together and total DNA was extracted from the leaves using a plant genomic DNA kit (Tiangen, China). The genomic DNA of S. miltiorrhiza were isolated and subjected to DNA sequencing using SMRT sequencing, a parallelized single molecule DNA sequencing by synthesis technology developed by PacBio (Pacific Biosciences, Menlo Park, California, United States of America). The DNA sequencing is done on a SMRTcell containing hundreds thousand of zero-mode waveguide (ZMW) nano-pores. Inside each ZMW, a single active DNA polymerase with a single molecule of single stranded DNA template is immobilized to the bottom through which light can penetrate and create a visualization chamber. The nucleotide substrate is phosphor-linked to a fluorescence dye. As the DNA synthesis proceeds, the fluorescence signal is detected when a nucleotide is held by the DNA polymerase before the nucleotide being incorporated into the DNA strand, resulting in the DNA sequencing in real time. We choose SMRT technology for this study as it is the only technology that is currently capable to directly detect potential modifications on native DNA without PCR amplification [24], The DNA was used to construct two libraries with the insert sizes of 1 kb and 10 kb, which were subjected to DNA sequencing following the standard protocol provided by the manufacturer (PacBio).

Genome Assembly
To assemble the genome, we downloaded 277 chloroplast genomes from GenBank on October 2012. These sequences were used to search against PacBio reads for S. miltiorrhiza. Similar reads with an E value , 1e-5 were extracted and subjected to sequence assembly using Seqman (version 9.0). The genome sequence of Sesamum indicum, which has the highest overall sequence similarity with S. miltiorrhiza reads, was used as a reference to order the contigs. The gaps were filled by iteratively searching the PacBio S. miltiorrhiza reads using the contigs, extracting reads that share sequence similarities, extending the ends, adding the reads to the assembly, and then performing reassembly. After obtaining an initial assembly, we designed primers (Table S1 in File S1) to amplify the four junctions between the single-copy segments and the inverted repeats. We sequenced the PCR products using a BigDyeV3.1 Terminator Kit for 3730XL (Life Technologies) and assembled high-quality Sanger sequences into the initial assembly to obtain the final assembly using Seqman (DNASTAR, WI). The final assembly has been deposited into EMBL (accession number HF586694). Note that another chloroplast genome of S. miltiorrhiza was published during the duration of this study [9]. After comparison, only 41 base differences were found between the two genome assemblies; the general features are otherwise identical (results not shown).

RNA extraction and RNA-Seq analysis
Total RNA was extracted from fresh leaves of three S. miltiorrhiza plants using an RNAprep pure plant kit (Tiangen, China) separately following the manufacturer's recommendation. The three RNA samples were pooled and the cDNA was amplified using a random-priming protocol. Ribosomal RNA was removed from total RNA using MICROBExpress Bacterial mRNA Enrichment Kit (Ambion, AM1905). A RNA-Seq library was constructed following a strand-specific protocol (dUTP, Illumina) [25]. The resulting cDNA were cleaved into small fragments (300 bp to 400 bp) to construct sequencing libraries, followed by emulsion PCR and then sequenced according to the manufacturer's protocols. The resulting data can be found in GenBank with accession number SRA1045051. All reads were mapped to the chloroplast genome assembly using Tophat (v1.3.2) [26]. The polycistrons and ncRNA were identified manually based on the identification of contigs assembled from overlapping RNA-Seq reads along the S. miltiorrhiza genome assembly. The 59 and 39 ends were determined based on the very ends of each contig. The RNA-Seq data were visualized and quantified using SeqMonk (v0.23.0, Babraham Bioinformatics, UK) and Tablet (v1.12.12.05) [27]. SeqMonk is able to call the abundances/expression levels for given regions and output the abundance as number of Reads Per Million (RPM). Considering the low expression levels of ncRNA, no coverage cutoff was set. In another word, the minimal coverage is set as 1 for the RNA-Seq data analysis.

Strand-specific real-time quantitative PCR
Total RNA from leaves of three plants were extracted as described above. Strand-specific real-time quantitative PCR (ss-qPCR) was performed as described previously [28]. Briefly, twenty-four pairs of antisense ncRNA and cRNA were randomly selected, and ss-qPCR primers were designed (Table S2 in File S1). A sequence similarity search was conducted to ensure that the primer sequences would uniquely amplify the expected regions. Strand specific cDNA templates were synthesized with tagged primers to add 18 to 20 nucleotide tags unrelated to S. miltiorrhiza (Table S2 in File S1: cRNAtag; GCTAGCTTCAGCTAGG-CATC and mRNAtag; CCAGATCGTTCGAGTCGT) at the 59 end (Table S2 in File S1) [28]. Reverse transcription with the tagged primer was performed using a PrimeScript RT-PCR Kit (Takara). The PCR was performed as follows: a 10 ml mixture containing approximately 500 ng of total RNA, 1 ml of dNTP mixture (10 mM each), and 2 pmol of tagged primer was heated for 5 min at 65uC and then cooled to 4uC. The reaction mixture contained 4 ml of PrimeScript buffer (56), 0.5 ml of RNase inhibitor (40 U/ml), 0.5 ml of PrimeScript RTase, and 5 ml of RNase-free water. The RT reaction was conducted at 45uC for 30 min and terminated by heating at 70uC for 15 min. The 0.5 ml reverse-transcription product was amplified with an SYBR Premix Ex Taq kit (Takara) according to manufacturer's instructions. The qPCR experiment was performed on an ABI 7500 Fast instrument (Applied Biosystems). The cycle conditions were as follows: 95uC for 10 min, followed by 40 cycles of 95uC for 5 s, and 60uC for 34 s. The primers used are listed in Table S2 in File S1. Melting curve acquisition and analysis were performed immediately after amplification using default settings, which are 95uC for 15 sec, followed by 60uC for 1 min, and 95uC for 15 sec.

Validation of RNA-Seq results with ss-qPCR
To validate the strand specificity and expression abundance of ncRNA detected in our RNA-Seq data, three total RNA samples extracted from leaves of three S. miltiorrhiza plants were analyzed using ss-qPCR as described above. The expression quantification by qPCR was performed as described previously [29] and a cRNA coding for atpB gene was chosen as the internal control for each ss-qPCR experiment.
For RNA-Seq data, we first specified the start and end of the overlapping regions for each pair of ncRNA and cRNA. SeqMonk was used to calculate the RPMs for ncRNA and cRNA for the overlapping regions. Validation of differential gene expression between ncRNA and cRNA was performed as previously described [30,31]. Briely, the log fold change, or log ratio, of ncRNA and cRNA was calculated as log[(RPM for ncRNA)/ (RPM for cRNA)]. For ss-qPCR data, the log ratio for each technical replication was calculated as [ (40-ct) for ncRNA] -[(40ct) for cRNA]. The log ratios of the three technical replications were averaged to create the mean and standard deviation of log ratio for each biological replicate. The statistical significance between the RNA-Seq and ss-qPCR results was tested using onesample t test.

Detection of DNA modifications and identification of DNA modification motifs
The total DNA was sequenced using SMRT technology following the standard procedures as described above. Sequence preprocessing and DNA modification detection were performed using the SMRT Analysis pipeline (version 1.3.2). The DNA modification motifs were identified using MotifFinder (PacBio).

Statistical Analysis
Statistical analyses, including one-sample t test and analysis of variance, were conducted using the JMP software (v9.0, SAS, NC) and customer scripts written with Perl module Statistics::Distributions (v1.02).

Integrative analyses of the S. miltiorrhiza chloroplast
We used strand-specific RNA-Seq, and SMRT technologies ( Figure S1 in File S1) to characterize the genome, transcriptome, and DNA modifications of the S. miltiorrhiza chloroplast. An ideogram summarizing our results is shown in Figure 1, which include the following information: the putative DNA modification sites on the positive strands (a) and negative strands (b); the expression level of noncoding genes on the positive strands (c) and negative strand (d); the expression level of coding genes on the positive strands (e) and negative strand (f); and the identified polycistrons on the positive strands (g) and negative strands (h). This ideogram provides an overall picture of the transcriptome and DNA modifications in the S. miltiorrhiza chloroplast, which are described in detail below.

Analysis of chloroplast gene expression in leaves using RNA-Seq technology
We conducted strand-specific RNA-Seq analysis of the S. miltiorrhiza transcriptome to understand the gene expression landscape of the S. miltiorrhiza genome. Chloroplast genes are of prokaryotic origin. Therefore, we used a protocol designed to analyze prokaryotic transcriptome. We generated up to 53,849,024 read pairs. Using Tophat [26], 553,014 (1.0%) of the reads were mapped to the S. miltiorrhiza chloroplast genome. In a previous study, 133 genes have been identified in the S. miltiorrhiza chloroplast genome [9], including 131 predicted functional genes and 2 pseudogenes. The functional genes consist of 114 unique genes, including 80 protein-coding genes, 30 tRNA genes, and 4 rRNA molecules. For genes with more than one copy, we selected only one for further analysis. Except for 14 unique genes, all genes were expressed, including 80 protein-coding genes, 4 rRNA genes, and 16 tRNA genes (Table 1). Fourteen tRNA were not found expressed. One possibility is that the tRNA transcript lengths range from 70 bp to 92 bp. These transcripts were excluded from our library because only fragments that are 300 bp to 400 bp long were selected to construct the library. By contrast, 16 tRNA genes were expressed, 9 of which were embedded in polycistrons (Table 1)  These genes were quantified using the Seqmonk software. The results are shown in Table 1. The most highly expressed genes were the rRNA 23S and rRNA 16S, as well as psbA, psbB, psbC, psbD, and rbcL. This result is consistent with the involvement of rRNA in protein synthesis as well as with the participation of psb genes and rbcL in photosynthesis. Interestingly, two tRNA genes, namely, trnA-UGC and trnI-UUC, were highly expressed at levels comparable to those of rRNA genes and psb subunit genes. The trnA-UGC and trnI-GAU transcripts accounted for 60.60% and 38.18% of all tRNA transcripts, respectively, even though alanine and isoleucine only account for 2/30 (6.7%) of the amino acids in the S. miltiorrhiza genome. Detailed examination showed that these two tRNA genes are located in polycistron pc17, which contains the rRNA genes. Their expression levels are consistent with those of other genes in the same polycistron. Therefore, this region is a potential site for foreign gene insertion (see below).

Polycistron identification
Most genes from chloroplast genomes are transcribed in polycistrons [32]. We determined the polycistrons in the S. miltiorrhiza chloroplast genome based on the presence of a continuous transcript. In total, we found 19 polycistronic transcripts containing 71 genes, which include 58 coding-protein genes, 9 tRNA, and all 4 rRNA. The polycistron identity, its member genes, and the strand on which the gene is located are shown in Table 1. By comparing with the results reported by Yang et al. [33], the polycistron (pc1) was identified for the first time.
The transcripts for a number of genes (atpI, rpoC2) were not entirely covered by the RNA-Seq reads. However, based on the results of a previous study [33], we consider these cases as examples of biased coverage of the sequencing technologies. Moreover, the corresponding broken transcripts (polycistron pc3) were treated as a continuous polycistronic transcript.

ncRNA identification and classification
A recent study found unexpected diversity in the ncRNA in the chloroplast transcriptome [10]. We performed a systematic analysis to identify ncRNA in the chloroplast genome of S. miltiorrhiza. To be consistent, we defined coding RNA (cRNA) as any RNA transcripts transcribed from genes encoding proteins. We define non-coding RNA (ncRNA) as any RNA transcripts transcribed that do not encode proteins and have a length . = 100 bp. We identified 136 ncRNA transcripts based on these criteria ( Table 2). These ncRNAs are classified into two groups, namely, intergenic ncRNA ( Figure 2A) and antisense ncRNA (asRNA, Figure 2B), based on their positions relative to the cRNA genes. Intergenic ncRNA are those located between two transcripts or polycistrons and its distance to the start or end position of each transcript was at least 100 bp (Figure 2A). Three types, namely, A1, A2, and A3, were found. A special case of intergenic ncRNA expresses from both strands and was defined as bilateral ncRNA (A4, Figure 2A). asRNA are those located on the antisense strand of a cRNA gene and overlaps with cRNA gene regions. Depending on their location relative to the corresponding coding region of the cRNA gene, the asRNA were further divided into five types: 59 (B1), 39 (B2), coding (B3), span (B4) and intron (B5) asRNA ( Figure 2B). B1 includes those asRNA transcripts that overlap with the region upstream of the coding region of cRNA gene but do not cover the entire coding region. B2 are those overlap with the region downstream of the coding region of cRNA but do not cover the entire coding region. B3 are those either overlap with part of the coding region (B3A), or cover the entire coding region (B3B). B4 include those span two adjacent cRNA genes. B5 include those located within the intronic region of the cRNA gene. In total, 136 ncRNA transcripts have been identified (Table 2). Their sizes range from 135 bp to 1296 bp, with a mean of 263 bp. They include 18 intergenic and 118 antisense ncRNA. The numbers of ncRNA transcripts belonging to various subtypes are: A1: 1; A2: 4; A3: 9; A4: 4; B1: 9; B2: 9; B3A: 72; B3B: 6; B4: 11; B5: 11; respectively.
To determine if these ncRNA are conserved across various plant chloroplast genomes, we compared these sequences with those from A. thaliana. A previous study has identified 107 ncRNA in A. thaliana chloroplast genome [10]. In total, 45 S. miltiorrhiza ncRNA were found similar to 39 A. thaliana ncRNA using BLASTN with an a cutoff of E value ,1e-5, (Table 3). Four types of mappings were identified, which are 1:1, 1:n, n:1 and n:n respectively. ''1:1'' means that one S. miltiorrhiza ncRNA was mapped to one and only one A. thaliana ncRNA. ''1:n'' means that one S. miltiorrhiza ncRNA was mapped to multiple A. thaliana ncRNA. ''n:1'' and ''n;n'' mean that multiple S. miltiorrhiza ncRNAs were mapped to one or multiple A. thaliana ncRNAs respectively. In particular, there are five n:1 mappings between S. miltiorrhiza and A. thaliana ncRNA. Moreover, one S. miltiorrhiza ncRNA (nc99) was similar to two A. thaliana ncRNA (ndhF-59-1, ndhF-59-2), a 1:n mapping. Last, nc92 and nc93 of S. miltiorrhiza are similar to trnL-ndhBint and ndhB-39 of A. thaliana, a n:n mapping. However, we can not exclude that possibilities that some of these n:1,1:n and n:n mappings resulted from partial transcripts. These conserved ncRNA between S. miltiorrhiza and A. thaliana may have   Validation of RNA abundance and differential expression from RNA-Seq data using ss-qPCR We first calculated the correlation between the RNA abundance obtained from RNA-Seq and the three ss-qPCR results respectively. For these 24 pairs of ncRNA and cRNA, pearson correlation coefficients of 0.749, 0.764 and 0.779 were obtained (data not shown).
The ncRNA and its corresponding cRNA can be considered as two different conditions for differential gene expression study. We compared the log fold changes, also called log ratios, of the expression levels for ncRNA and cRNA obtained from RNA-Seq and ss-qPCR experiments. The calculation of the log ratio of ncRNA/cRNA from RNA-Seq and each of the ss-qPCR experiments is described in the method section. The log ratios are shown in Figure 3. For a pair of ncRNA and cRNA, the statistical significance between the RNA-Seq and the ss-qPCR results were tested using one-sample t test (Table S3 in File S1). A p value ,0.05 was considered significantly different. In total, 17 out of 24 pairs of ncRNA and cRNA (70%) did not exhibit significantly different gene expression between RNA-Seq and ss-qPCR results (Figure 3). The 7 pairs showed significantly different expression between the RNA-Seq and ss-qPCR results are indicated with ''*''. Overall, these findings demonstrate the reasonably good quality of our RNA-Seq data.

Interplay between asRNA and cRNA expression
To detect any interplay between cRNA,and asRNA, we determined the expression levels of cRNA and asRNA transcripts of all 80 protein-coding genes ( Table 2). It should be pointed out that, in Table 1, the start and end positions of cRNA are defined based on those of ncRNA. While in Table 1, the start and end positions of cRNA were defined based on those from gene annotations. At a result, the expression levels are slightly different for the same cRNA in Table 1 and 2. The expression of antisense strands was not detected in 27 of the 80 protein coding genes. A Students' t-test showed that the abundance of cRNA with asRNA expression was significantly higher than those of cRNA with no ncRNA expression (p,0.05) ( Figure 4A). This result suggests that asRNA expression promotes cRNA expression on the opposite strand (see the Discussion section for additional details). As previously described, asRNA can be divided into four types based on their relative positions to the coding transcripts. To detect any correlation between the asRNA locations and cRNA expression, we plotted the asRNA expression by their types ( Figure 4B). The average abundance of cRNA with 39 coding asRNA and intronic asRNA were similar to those shown in Figure 4A. However, the abundance of cRNA with 59 asRNA was considerably lower but the difference was not statistically significant, which indicates the possible repressive effects of asRNA on cRNA abundance.

Identification of DNA modification sites and DNA modification motifs
DNA modifications influence cellular functions. SMRT technology no only produces considerably longer reads than other second-generation DNA sequencing technologies but also provides detailed kinetic information of the DNA polymerase. As described earlier, SMRT sequencing technology detects the fluorescent signal during the period when the DNA polymerase is ''holding'' a nucleotide. These form fluorescent ''pulses'' as DNA polymerase continue incorporating nucleotides into the elongating DNA strand. The interpulse duration (IPD) is the time difference between two ''pulses''. IPD ratio is calculated as [IPD at a site at a condition under study]/[IPD at the same site at the control condition], The IPD changes significantly in the presence of DNA modifications. Based on the IPD ratio, the DNA modification status of the base can be inferred [20]. Using the DNA modification analysis module of SMRT Portal 1.3.2, we identified 2687 (p,0.01) putatively modified bases throughout the genome ( Figure 5A). Among these, 90.3% IPD ratio ranged from -2 to 2. We identified two DNA motifs using the motif analysis module, namely, ''TATANNNATNA'' ( Figure 5B) and ''WNYANT-GAW'' ( Figure 5C), whose percentage of putatively modified sites was 35/97 (36.1%) and 91/369 (24.7%), respectively (Table 4). Detailed information on the sites belonging to motifs 1 and 2 is included in Tables S4 and S5 in File S1, respectively.

Interplay between DNA modification and gene expression
The sequence of DMM1, ''TATANNNATNA,'' is similar to that of a TATA box frequently found in the core promoter of prokaryotes. Therefore, whether the modifications of these putative TATA box motifs would affect the expression of genes downstream of the modification sites is of significant interest. We extracted the transcripts corresponding to genes coding for protein, tRNA and ncRNA downstream of these DMM1 motifs. The downstream transcripts were further classified into three groups based on the distance between the modification motifs and the start of the downstream genes. For the group in which all transcripts were ncRNAs and the distance between DMM1 and the start of ncRNAs ranged from 100 bp to 500 bp (Table S6 in File S1), the presence of DNA modification was significantly correlated with higher ncRNA expression levels. The difference was significant with p,0.01 ( Figure 6 and Table S7 in File S1).

Prediction of sites for foreign gene insertion
We systematically analyzed the regions to select those that are suitable insertion sites. Our criteria included the following: (1) the region should have minimal effects on the expression of  (2) it has potential to be highly expressed. In particular, we selected locations that are in the intergenic regions of polycistronic sites and at which the relative expression levels of genes are high. The top-ranking regions are presented in Figure 7. Interestingly, these regions include trnI-GAU and trnA-UGC ( Figure 7C), which have been widely used in chloroplast-based genetic engineering [34]. The consistency between our prediction and those already used experimentally indicates that our selection criteria are reasonably well.

RNA-Seq technology
The current study takes advantage of two advanced technologies, RNA-Seq and SMRT, to identify asRNA and DNA modifications and then characterize their effects on gene expression in the S. miltiorrihiza chloroplast genome. RNA-Seq is a revolutionary tool for transcriptomic analysis [35]. It has several advantages over other transcriptomic methods such as tiling microarray and cDNA or EST sequencing [36], namely, independence from genomic sequence, low background noise, wide dynamic range to quantify gene expression level, and high throughput, among others. Moreover, strand-specific RNA-Seq methods provide information on the orientation of transcripts, Figure 3. Validation of RNA-Seq data using ss-qPCR. Specific primers were designed and used to measure the abundance of the transcripts that correspond to the positive and negative strands using ss-qPCR. The log ratio (X axis) of the abundance of ncRNA and cRNA transcripts was used to measure the differential expression of this locus. (*) indicates those transcripts whose log ratio derived from the RNA-Seq result is significant different from those of ss-qPCR (p,0.05). doi:10.1371/journal.pone.0099314.g003 which is valuable for transcriptome annotation particularly for regions with overlapping transcription from opposite directions [25]. As demonstrated in the current study, strand-specific RNA-Seq technology enables the detection of polycistrons and asRNA from chloroplast transcriptome, which is essential for interpreting the functional elements of chloroplast genomes.

SMRT technology
SMRT is a novel technology for detecting DNA methylation while simultaneously determining the context of the corresponding DNA sequence [20]. Before the development of SMRT technology, the most commonly used methods for detecting DNA modification are largely limited to bulk methods such as chromatography, mass spectroscopy, electrochemistry, radioactive labeling, immunochemical assays, and sensitivity to restriction enzymes [20]. However, these approaches suffer from low resolution and an inability to identify the precise sequence context of the methylation site(s). The SMRT technology allows the collection of kinetic data for the enzyme during the incorporation of each dNTP into the DNA strand. Significant changes in kinetic parameters such as IPD ratio should be observed when the DNA polymerase encounters m6A, m5C, or 5-hmC on the template strand. These distinct kinetic signatures allow the identification of the type and position of the base modification in the DNA template [20]. The SMRT technology is likely to enhance DNA  modification studies on samples that were not previously accessible for this type of research, such as organelle genomes.

Gene expression regulation by ncRNA
Several mechanisms have been proposed to explain the regulatory role of ncRNA, including transcription itself (transcriptional interference), as well as the formation of duplexes of RNA-DNA (chromatin remodeling) and RNA-RNA (dsRNA-dependent RNA formation) [37]. Under the first model, the abundance of natural sense-antisense transcript (SAT) pairs was found to be inversely correlated and antisense transcription leads to the downregulation of the corresponding sense transcript. Under the second model, the antisense transcript recruits histone-modifying enzymes that alter the chromatin structure, resulting in sensetranscript repression. Consistently, sense transcription is induced in the absence of antisense transcripts. Under the last model, double-stranded RNA formation spans an intron, which inhibits splicing and leads to a mature transcript with a retained intron.
Although the majority of studies on asRNA focused on those from nuclear genomes, some studies focused on asRNA in the chloroplast genome. For example, overexpression of a natural chloroplast-encoded antisense RNA (AS5) in tobacco destabilizes 5S rRNA and retards plant growth, possibly because AS5 prevents the accumulation of misprocessed 5S rRNA and controls its stoichiometry [18]. The majority of previous studies showed that asRNA expression is negatively correlated with that of the cRNA. However, positive correlation has also been found between asRNA and cRNA in studies on human breast epithelium and mammals [38,39].
In our study, we showed that at the genome scale, the cRNA expression levels in SATs are significantly higher than those not in SAT (p,0.05) ( Figure 4A). In other words, cRNA expression is positively correlated with asRNA expression. One likely mechanism is that asRNA expression might open up the chromatin and increase the accessibility of the corresponding region to the transcriptional machinery for cRNA, causing higher expression of the corresponding cRNA. Nevertheless, we cannot exclude the possibility that cRNA expression causes the higher asRNA expression in the same manner. Additional studies are in need to test these hypotheses.

Gene expression regulation by DNA modification
Previous studies indicated that gene transcription and DNA methylation are closely interwoven processes [40]. In mammalian genomes, DNA methylation is generally associated with silent CpG island promoters. However, the majority of CpG island promoters remain methyl-free regardless of their expression status [41]. In human testicular cancer, the addition of methylation inhibitors leads to downregulation of ncRNA expression [42]. In plants, a spontaneous mutation causing a single nucleotide polymorphism (SNP) between wild-type and mutant genes alter the secondary structure of LDMAR, an lncRNA gene. This change increases methylation in the putative promoter region of LDMAR, which reduces LDMAR transcription specifically under long-day conditions [43].
Although the expression regulation of chloroplast genes has been intensively studied [10], the presence of ncRNA and DNA modifications and their interaction with the expression of cRNA remain unclear. In the current study, we use SMRT technology and detected possible base modifications in chloroplast genomes. Furthermore, we identified two motifs that are associated with base modifications. As the first motif is similar to the TATA box in sequence, we hypothesize that it might regulate the expression of downstream genes as a transcriptional factor binding site. After classifying the downstream genes by gene type and their distance to the putative TATA box, the modification was found to be associated with significantly higher expression of ncRNA but not those protein-coding and tRNA genes. One explanation is that proteins and tRNA-coding genes might be regulated by specific elements such as promoters and enhancers, and that base modification plays a limited role in their expression regulation.  However, ncRNA expression regulation is less specific, and the modification of its upstream TATA box-like motif leads to significant differences in expression. Contrary to the general thoughts that DNA methylation downregulate gene expression, we found that CPGDMM1 methylation is correlated with increased ncRNA expression. These observations may reflect the unique mechanisms present in chloroplast genomes. Further investigation is needed to confirm these associations and to elucidate possible mechanisms.

Chloroplast based genetic engineering
Chloroplast vector systems have several advantages and consequently have multiple biotechnological applications [4]. One of our rationales for this study is to lay the foundation for designing efficient, chloroplast-based transformation vectors for S. miltiorrhiza. Chloroplast genomes are most abundant in the leaves. Therefore, focusing on the chloroplast genome characteristics in the leaves is reasonable. After obtaining a detailed expression map of S. miltiorrhiza chloroplasts in the leaves, we systematically predict regions that are potential insertion sites for foreign genes. Two of them have already been commonly used as sites for foreign gene insertion. Furthermore, the region between psaA and psaB on polycistron pc8 and that between ndhK and ndhC on polycistron pc9 ( Figure 7) were also identified as the most preferred insertion sites. These two sites have not been used previously and represent novel insertion site for foreign genes. An extensive validation of these sites for their application in chloroplast-based genetic engineering research will be the subject of a separate study.

Technical limitations
Second-and third-generation DNA sequencing technologies have made feasible the detection of ncRNA and DNA modifications at the genome scale. However, formidable experimental difficulties are inherent to antisense RNA and DNA modification research. First, most asRNA transcripts are expressed at significantly low levels and are thus difficult to be validated using classical technologies such as northern blot analysis and in situ hybridization. Second, the intricate relationship between sense and antisense transcripts means that experimental perturbation of one transcript inevitably interferes with the expression of the other transcript. Consequently, determining the biological functions of the transcripts by knocking-in and knocking-out the transcript alone is complicated. Third, while the SMRT technology has been shown to be able to detect potential DNA modifications, validating these modifications remains a challenging task. Forth, validation of the presence and function of asRNA and DNA modifications in the chloroplast is even more difficult. Any perturbation experiment has to consider how to specifically target the transcript in the organelle because the molecule has to be delivered to the chloroplast. Furthermore, excluding interference from the nuclear transcripts is difficult because the interaction between those from the nucleus and those from the chloroplast is well known.
In summary, validating the discovery described in this study is technically formidable at present time. Nevertheless, the data presented in this study have demonstrated the complexity of gene expression regulation by asRNA and DNA modification.

Supporting Information
File S1 Contains the files: Figure S1. Characteristics of SMRT sequencing results: (A) length distribution of sequence reads; (B) depth of coverage across the assembled chloroplast genome; and (C) abundance of various lengths of reads mapped to the chloroplast genome. Table S1. Primers used for sequence assembly validation. Table S2. Primer sets used for strandspecific real-time qPCR. Table S3. Statistical testing of RNA-Seq and ss-qPCR results using one-sample t test. Table S4. Detailed information on motif CPGDMM1. Table S5. Detailed information on motif CPGDMM2. Table S6. Associations between DNA modification site (DMS) and downstream ncRNA expression. Table S7. Analysis of the effect of DNA modification at the CPGDMM1 motif on the abundance of ncRNA using ANOVA. (DOC)