Transcriptional Slippage and RNA Editing Increase the Diversity of Transcripts in Chloroplasts: Insight from Deep Sequencing of Vigna radiata Genome and Transcriptome

We performed deep sequencing of the nuclear and organellar genomes of three mungbean genotypes: Vigna radiata ssp. sublobata TC1966, V. radiata var. radiata NM92 and the recombinant inbred line RIL59 derived from a cross between TC1966 and NM92. Moreover, we performed deep sequencing of the RIL59 transcriptome to investigate transcript variability. The mungbean chloroplast genome has a quadripartite structure including a pair of inverted repeats separated by two single copy regions. A total of 213 simple sequence repeats were identified in the chloroplast genomes of NM92 and RIL59; 78 single nucleotide variants and nine indels were discovered in comparing the chloroplast genomes of TC1966 and NM92. Analysis of the mungbean chloroplast transcriptome revealed mRNAs that were affected by transcriptional slippage and RNA editing. Transcriptional slippage frequency was positively correlated with the length of simple sequence repeats of the mungbean chloroplast genome (R2=0.9911). In total, 41 C-to-U editing sites were found in 23 chloroplast genes and in one intergenic spacer. No editing site that swapped U to C was found. A combination of bioinformatics and experimental methods revealed that the plastid-encoded RNA polymerase-transcribed genes psbF and ndhA are affected by transcriptional slippage in mungbean and in main lineages of land plants, including three dicots (Glycine max, Brassica rapa, and Nicotiana tabacum), two monocots (Oryza sativa and Zea mays), two gymnosperms (Pinus taeda and Ginkgo biloba) and one moss (Physcomitrella patens). Transcript analysis of the rps2 gene showed that transcriptional slippage could affect transcripts at single sequence repeat regions with poly-A runs. It showed that transcriptional slippage together with incomplete RNA editing may cause sequence diversity of transcripts in chloroplasts of land plants.

In bacteria such as Escherichia coli, transcriptional slippage (TS) occurs at simple sequence repeat (SSR) regions, especially at poly-A or poly-T runs [10,11].TS is mediated by the bacterial-type RNA polymerase, which leads to the incorporation of more or less nucleotides into the transcript, thereby producing a population of heterogeneous mRNA sequences.The heterogeneous transcripts include RNA sequences that are inconsistent with their DNA templates.If TS occurs in open reading frames, RNA transcripts of a gene are expected to encode different protein products.Both plastid-encoded RNA polymerase (PEP) and nuclear-encoded RNA polymerases (NEP) are active in CPs [12].Unlike the phage-type NEP, PEP has a core enzyme like E. coli RNA polymerase [13,14], so TS is likely to occur in CPs.TS has been reported to occur in endosymbiontic bacteria such as Buchnera aphidicola and Blochmannia pennsylvanicus [15], but has not yet been found in CPs.Recently, several RNA-seq-based papers showed the genome-wide view of transcript variability in Arabidopsis thaliana [16][17][18] and ferns [19], but these studies focused on only RNA editing (RE), not TS.
RE is a post-transcriptional modification in CPs that can change C to U residues or vice versa at specific sites of RNAs in both coding and noncoding regions, producing transcripts that are inconsistent with their DNA templates [20,21].RE in coding regions can alter the amino acid sequence of proteins.For example, editing of the second position of the ACG codon at 5' end of transcripts will create an AUG initiation codon [22,23], and editing of the first position of CAA, CAG and CGA will create stop codons [23].RE is common in CPs and editing patterns have been studied in crops such as maize [24,25], sugarcane [26], rice [27], pea [28], tomato [26], cotton [29] and black pine [30].
Mungbean (Vigna radiata L.), an increasingly important legume crop, is currently grown on 6 million hectares, mainly in Asia, but also in Australia and Canada.Genomics studies of this crop lag behind that of many other plants, although because of its short generation time and relative small genome size, mungbean could be a model legume plant for genomic analyses.In the present study, we used mungbean as a model plant to investigate TS and RE in CP genes.We compared the CP genomes of three mungbean genotypes based on the deep sequencing results and analysed low frequency sequence variations in the transcriptome of mungbean RIL59.RT-PCR, cloning and Sanger sequencing were used to validate the findings.The results were compared to CP transcriptome variations in the major lineages of land plants.

Plant material and genomic DNA purification
The seeds of Vigna radiata var.sublobata TC1966, V. radiata var.radiata NM92 and RIL59, an F 11 recombinant inbred line derived from a cross of TC1966 x NM92, were obtained from AVRDC-The World Vegetable Center, Taiwan.The seed was germinated and planted in the greenhouse of the Institute of Plant Molecular Biology, Academia Sinica, Taiwan.Genomic DNA (gDNA) of TC1966, NM92 and RIL59 was extracted from 0.5 g of young leaves harvested from seedlings with use of the Plant Genomic DNA Extraction Minprep kit (Viogene-Biotek Corp.New Taipei City, Taiwan) following the manufacturer's instructions.The integrity and purity of the extracted gDNAs were confirmed by gel electrophoresis and spectroscopy verifying that the ratio 260 to 280 nm is > 1.8 and the 260 to 230 nm is > 2.0.

Whole genome sequencing and read assembly
The qualified gDNAs of mungbean lines were sequenced on an Illumina HiSeq 2000 sequencer (Genomics BioSci & Tech Co., New Taipei City, Taiwan).Paired-end reads (2 x 100 bp) trimmed with an error probability < 0.05 were collected for genome assembly.The CP genome was assembled by use of MIRA 4 [31] and MITObim 1.7 [32].V. radiata var.radiata KPS1 CP genome (GenBank accession no.GQ893027 [33]) was used as a reference.The workflow of MITObim was described in detail in [33].The CP draft sequences generated from MITObim were re-mapped by using bowtie2 2.2.1 [34] to check the accuracy of assembly.Re-mapping alignments were visualized with use of Integrative Genomics Viewer 2.3.32 [35,36], then manually inspected.Sites and regions with a depth lower than 1/100 of the average coverage or ambiguous sequences were considered as gaps and were verified by PCR and Sanger sequencing.The boundaries of inverted repeats (IRs) and single copy regions (SCs) were verified by PCR and Sanger sequencing.Calculation of the genome coverage and conversion of alignment formats were carried out by using SAMtools 1.1 [37].The CP sequences were deposited at DDBJ (DNA Data Bank of Japan).

Sequence annotation
Protein coding, ribosomal RNA (rRNA), and tRNA genes were annotated by use of DOGMA [38] and the annotation was confirmed by BLAST searches (http://blast.ncbi.nlm.nih.gov/Blast.cgi).All tRNA genes were further verified by use of tRNAscan-SE 1.21 [39].The genome map was drawn with use of Circos 0.67 [40].CP SSRs with length more than 8 bp were identified by use of MISA [41] with default parameters.The CP genomes of different mungbean lines were aligned by use of bl2seq included in NCBI BLAST (http://blast.ncbi.nlm.nih.gov/Blast.cgi).

RNA extraction and RNA-seq
Total RNA was extracted from a whole plant of 1-month-old RIL59 seedling with a modified CTAB-based method [42].The gDNA contaminants were removed from extracted RNA by use of Ambion TURBO DNA-free Kit (Life Technologies Co.).DNA-removed RNA samples were analysed on an RNA LabChip with Agilent 2100 Bioanalyzer.Samples with >7.0 RNA integrity, >1.4 rRNA ratio, and no DNA contamination were selected for experiments.The qualified RNA underwent ribosomal RNA depletion with TruSeq Stranded Total RNA with Ribo-Zero Plant Kit (Illumina, San Diego, CA, USA).The cDNA libraries were generated by RNA fragmentation, random hexamer-primed cDNA synthesis, linker ligation and PCR amplification.RNA-seq was carried out on an Illumina HiSeq 2000 system (Genomics BioSci & Tech Co., New Taipei City, Taiwan).Three criteria were used to filter dirty raw reads: remove reads with sequence adaptors; remove reads with > 2% 'N' bases; and remove low-quality reads, which have more than half of bases with QA 15.The RNA-seq data of RIL59 was uploaded to the NCBI Sequence Read Archive (accession no.SRR1867748).

Identification of transcript diversity
The clean reads were mapped to the sequence of the RIL59 CP genome with use of TopHat2 [43]/bowtie2 [34].The parameters for TopHat2 used in mapping were set to "-N 2-read-gap-length 2-max-insertion-length 3-max-deletion-length 3".The RNA-seq alignment was visualized by use of Integrative Genomics Viewer 2.3.32 [35,36], then manually inspected.The RE was inferred from C-to-U substitutions with a cut-off value of at least 5% edited reads.Indelcontaining reads in every SSR locus were assumed to be results from TS.The editing ratio and TS frequency were inferred from the number of substitution and indel reads to total aligned reads at RE and TS sites, respectively.

Verification of TS and RE sites
First strand cDNA was synthesized with use of SuperScript III Reverse Transcriptase (Invitrogen) following the manufacturer's protocol with gene-specific primers (S1 Table ).PCR and RT-PCR were performed with use of a high fidelity Taq polymerase Pfu (Invitrogen).Amplicons of RT-PCR were directly sequenced by use of ABI PRISM 3730xl.To further verify the TS and RE ratio, RT-PCR amplicons were cloned, then sequenced using ABI PRISM 3730xl.The TS and RE ratio was inferred from the ratio of different sequences of clones.

Sequencing and general features of the mungbean CP genome
The whole genome shotgun sequencing (WGS) of three mungbean lines, NM92, TC1966 and RIL59, generated 34.17, 37.96 and 154.48 million clean paired-end reads, respectively.Through baiting and iterative mapping, CP sequence reads were extracted from the total read pools and assembled into whole CP genomes.The achieved read depth of the CP genomes of TC1966, NM92, and RIL59 were on average 5,168-, 6,409-and 7,914-fold (S2 Table ), respectively.We compared the CP genomes of wild mungbean TC1966 (DDBJ accession no.AP014692), mungbean line NM92 (DDBJ accession no.AP014691), RIL59 (this study) and line KPS1 [33].The length of the TC1966 CP genome was 151,283 bp, which is 12 bp longer than the CP genome of NM92 and RIL59 (Fig 1).Both NM92 and RIL59 CP genomes have identical sequences, of 151,281 bp, and are 10 bp larger than the CP genome sequence of KPS1.In total, 213 SSRs, including one dinucleotide repeat and 212 single nucleotide repeats, were identified in the CP genomes (S3 Table ).Overall, 108 of the single nucleotide repeats are 8-mers, 66 are 9-mers, 17 are 10-mers, 12 are 11-mers, and 5 are 12-mers.SSRs from 13-mer to 16-mer represent one single nucleotide repeat each.

SNVs, indels, and possibility of paternal leakage
The four mungbean CP genomes were aligned for detailed sequence comparison.Between TC1966 and NM92 (RIL59) CP genomes, 78 single nucleotide variations (SNVs) and 9 indels contribute a 78-bp and 10-bp difference (S4 and S5 Tables), respectively.In total, 88 bp CP genome map and comparisons of Vigna radiata var.sublobata TC1966.Four concentric circles from the outside to inside represent the TC1966 CP genome map, the variances between TC1966 and NM92 (RIL59), the variances between TC1966 and KPS1, and the length scale, respectively.The histograms on the outer and the inner circles of the TC1966 CP genome map denote the clockwise and the anticlockwise transcription of the genes, respectively.doi:10.1371/journal.pone.0129396.g001distinguish the CP genomes of TC1966 from NM92 (RIL59).Of the 78 SNVs, 13 are transitions and 65 are transversions.Of the nine indels, eight are 1-bp and one is 2-bp.As expected, the CP genome of the cultivated variety KPS1 is more distant from wild mungbean TC1966 than the other two cultivated mungbean genotypes NM92 and RIL59 (S4 and S5 Tables).The KPS1 CP genome contains all variants that distinguish NM92 (RIL59) from TC1966 and has 32 additional 1-bp indels as compared with TC1966 (Fig 1).Uniparental transmission of the CP genome via the female gametophyte is the most common mode of inheritance in angiosperms [44], but a low-frequency leakage of plastids via pollen seems to be universal in plants.Paternal leakage was assessed for the cross NM92 x TC1966 by in-depth comparison of the CP genome sequence polymorphisms between the parents and RIL59.The RIL59 CP genome was identical to its maternal parent NM92.No heteroplasmy pattern was observed in RIL59 after assessing all SNVs and indels documented for comparing of the CP genomes of TC1966 and NM92 at a sequencing depth from 7,762-to 7,998-fold.The absence of TC1966-specific variants suggested that the possibility of paternal leakage is low or even absent in RIL59.
Transcriptional slippage in the mungbean CP genome is positively correlated with SSR length TS takes place in bacteria at SSR tracts and alters the transcript size at SSRs.To study the TS pattern in the mungbean CP genome, we assessed transcript diversity at all SSRs present in the CP genome.For 213 CP SSRs, variation in SSR sequence length was observed in 0 to 73.86% of all transcripts from SSR-containing loci (S6 Table ).The 13-, 14-, 15-and 16-bp-long SSRs presented in single locus.Excluding the single-locus SSRs that lack replicates in statistical analyses, the mean TS frequency was positively correlated with the length of the corresponding SSR motif (R 2 = 0.9911, Fig 2).Although the R 2 is close to 1, the standard errors are relative higher for longer SSRs, indicating TS frequency tends to be highly variable in longer SSRs.This finding suggested that several polymerases with different fidelity are subject to errors at longer SSR transcripts.The polymerases involved in this phenomenon are suggested to be PEP, NEP, and even the polymerases used in the RNA-seq experiments [45,46].This suggestion is reasonable because genes in CP are transcribed by PEP, NEP only or both [12].
Moreover, Taq polymerases and reverse transcriptases are possibly subject to errors at SSRs [45,46].In order to minimize the risk that the indels detected in the transcriptome sequences were experimental artefacts, high fidelity polymerases such as Pfu and SuperScript III Reverse Transcriptase were used in the verification experiments.Nevertheless it cannot be excluded that indels introduced by the polymerase and reverse transcriptase contributed to the observed transcript variability.

Identification of RE sites
RE sites were identified as mismatches between RNA-seq reads and the CP genome sequence.In total, 41 C-to-U editing sites were detected in transcripts of 23 CP genes (Table 1), with no U-to-C editing identified.The editing sites were almost exclusively located in protein-coding regions, only one site was found in the intergenic region between rpl20 and 5'rps12 genes.Five editing sites altered the first nucleotide of a codon, 34 sites the second, and one site the third nucleotide of a codon.Altogether, 39 nonsynonymous and one synonymous substitutions were generated in the RIL59 CP genome by RE.The editing level was inferred from the C versus U ratio of the transcripts derived from the respective loci (Table 1).Three sites located in atpA, ndhD and the intergenic region between rpl20 and 5'rps12 were edited at the lowest observed frequency (< 33.3%).Five sites located in psbN, rpoA, petL, rpoB and rpoC1 showed moderate editing levels (33.3-66.7%)and the remaining sites were edited at high levels (> 66.7%).

U-indels overlapping with RE sites appear in transcripts of psbF
RNA-seq data suggested that 12.03% of psbF transcripts had a U-insertion or U-deletion near the editing site (S1 Fig) .The observation was verified by RT-PCR and Sanger sequencing.The RT-PCR products showed sequencing "noise" at the editing sites and the end of the SSR (Fig 3).For further confirmation of the U-indels, the amplicons were cloned and 62 independent colonies were randomly chosen for Sanger sequencing.From these clones, 47 (75.81%) showed C to U editing, 5 (8.06%) had unedited sequences, and U-insertion and U-deletion sequences were found in 4 (6.45%) and 5 (8.06%) clones, respectively.One of the clones (1.61%) had an unedited C at the RE position but a U deletion nearby (Fig 3

and S2 Fig).
The similar percentage of indel revealed in the transcripts in both RNA-seq data and experimental verification with use of high fidelity polymerases suggests that TS indeed subjects indels at the locus rather than an artefact from Taq polymerase.

Origin of the U-indel containing transcripts
Horizontal gene transfer from CPs to the nucleus or to mitochondria has been documented for plants [47].Nuclear plastid like (NUPT) and mitochondrial plastid like (MTPT) sequences generally accumulated substitutions and indels during their evolution.To test whether the Uindel psbF transcripts were derived from NUPTs or MTPTs and therefore not from the CP genome, we used bowtie2 with default parameters for mapping WGS and RNA-seq reads to the CP genome.The WGS reads, comprising 15.45 Gbp, covered the 579 Mbp mungbean genome [48] approximately 26.7-fold.For WGS read mapping, if reads from NUPTs and MTPTs were mapped to the CP genome, their depths should be approximately equal to or many times greater than 26.7-fold, depending on the copy numbers of NUPTs and MTPTs in mungbean.All WGS reads corresponding to psbF mapped to the CP genome without any U-indel (S3 Fig) .In contrast, RNA-seq reads mapped by the same procedure showed C-to-U substitution and Uindel patterns (S3 Fig), which strongly indicates that the U-indels are derived from transcriptional processes at the CP genome level, not from NUPTs or MTPTs, which agrees with the psbF variants derived from the CP genome.

Relation of U-indels in psbF transcripts to RE
C-to-U and U-to-C substitutions by RE are known processes in plant organelles [24,25], whereas U-indels have been detected in mitochondrial transcripts of Trypanosomatids and Myxomycota [49].U-indels have not yet been analyzed in-depth for plants.To test whether Uindels in psbF transcripts are related to RE, we searched for evidence for these features in publicly available RNA-seq data for land plants.Transcriptome sequence data for four dicot species (Rosid: Glycine max, Arabidopsis thaliana, Brassica rapa; and Asterid: Nicotiana tabacum), two monocot species (Oryza sativa and Zea mays), two gymnosperm species (Pinus taeda and Ginkgo biloba) and a moss (Physcomitrella patens) available from the Sequence Read Archive (SRA) at NCBI were analyzed for presence of U-indels and RE in psbF transcripts (S7 Table ).All tested Rosid species and G. biloba but not the remaining species shared the same editing site in psbF [50].The SRA data for these nine species were mapped to the psbF sequences of the corresponding CP genomes, and the C-to-U editing pattern in these nine species was To further verify this finding, we randomly cloned and sequenced 80 of psbF RT-PCR amplicons from N. tabacum, which lacks the RE site in psbF, and found that two clones had an U-insertion and one clone an U-deletion (S5 Fig) .Additionally, the psbF amplicons derived from P. patens, the most distant species of the chosen land plant set, were selected for cloning to ensure U-indels.From 155 sequenced psbF clones for this species, two clones (1.3%) showed U-indels at the poly-U SSR of the gene (S6 Fig) .From the presence of U-indels in psbF transcripts for the two species that lack the RE site in this gene we concluded that C-to-U RE and U-indel generation in this gene are independent of each other.The sequence surrounding the RE site in psbF contains nine continual T residues in mungbean, suggesting that U-indels in psbF transcripts are generated by TS.

Sequence variation in psbF transcripts might lead to at least five different proteins sharing the same N-terminal
The psbF gene is part of the psbE operon, which contains psbE, psbF, psbL and psbJ genes.The psbF transcripts encode a 39 amino acid protein, with an edited phenylalanine or an unedited serine codon in the 77th nucleotide.U-indels in psbF transcripts lead to frame shifts, resulting in aberrant termination of translation.U-insertion at this site in psbF transcripts would result in a longer protein variant with 42 instead of 39 amino acid residues.The first 27 N-terminal amino acid residues of this variant would correspond to those of the common psbF protein, whereas the 12 following amino acid residues would be altered by the frame shift.The common psbF stop codon would be bypassed, thus resulting in a psbF variant with three additional amino acid residues.U-deletion transcripts would also bypass the common psbF stop codon and extend the psbF open reading frame to the stop codon of psbL, thus resulting in a psbF-psbL fusion protein with 85 amino acids in length.Because RE and TS are independent mechanisms, six different psbF transcripts are expected to be generated, but only five were identified by our experiments (Fig 3).Further research at the protein level is needed to test whether every different transcript can be translated.Also, if transcripts are translated, the functions of the psbF protein variants in vivo would be of interest.

TS patterns in ndhA and rps2 genes
The psbE operon is conserved in land plants [51] and is mainly transcribed by PEP, an E. colilike RNA polymerase.Similar to the psbF transcripts, TS may occur in other PEP transcribed genes.Like psbF, the ndhA is also mainly transcribed by PEP.The ndhA gene in land plants is conserved, except Pinaceae lack all ndh genes [52].In addition, the ndhA has high TS frequency (22.37%) at a poly-U SSR with 9 bp in size (S6 Table ).Therefore, we assessed the TS patterns of the ndhA gene in the main lineages of land plants using both bioinformatic and experimental methods.
Although the SSR in ndhA is relatively short, 2.24% reads still revealed indels at WGS mapping result (S7 Fig) .We searched the ndhA sequence against whole nuclear [53] and mitochondrial genomes [54] of mungbean by using blastn.Results showed that ndhA-like sequences Interestingly, a poly-A SSR in the ndhA intron of A. thaliana had a 6.3% of A-indel reads at that SSR.The rps2 gene was selected to ensure whether TS could occur at poly-A type SSRs.The SSR region in rps2 gene is conserved in Fabaceae but not in other land plants.Similar to the ndhA gene, 4.78% reads showed indels from mungbean WGS mapping result of rps2 gene (S10 Fig) .Also, the results from blastn and amplicon sequencing supported that the rps2 gene only present in CP genome and the indel-read possibly generated from a non-proofreading Taq.
At the results of RT-PCR amplicon sequencing, as expected, the "noise" peaks are revealed only in Fabaceae species, V. radiata and G. max (S11 Fig) .Altogether, TS is associated with the SSR, especially in genes transcribed by PEP.Therefore, indels at SSR sites caused by TS are a common feature in land plants.

Conclusions
The study demonstrated that deep RNA-seq is a powerful tool to obtain a detailed view of transcript variability and can detect sequence variations caused by TS and RE.TS is a common phenomenon in CP transcripts of plants.In mungbean, TS frequency in CP transcripts is highly positively associated with the CP SSR length.TS in CP transcription resembles TS in bacteria [10,11] and TS in CP may be a relict from the bacteria origin of these organelles.Our results demonstrated that TS and RE caused transcript diversity in the CP genome.The implication of the diversity on translation and protein function remains to be elucidated.

Fig 1 .
Fig 1.CP genome map and comparisons of Vigna radiata var.sublobata TC1966.Four concentric circles from the outside to inside represent the TC1966 CP genome map, the variances between TC1966 and NM92 (RIL59), the variances between TC1966 and KPS1, and the length scale, respectively.The histograms on the outer and the inner circles of the TC1966 CP genome map denote the clockwise and the anticlockwise transcription of the genes, respectively.

Fig 3 .
Fig 3. RNA transcript variation of psbF gene.(A) CP genome sequence detected by direct sequencing of PCR amplicons.(B) The sequencing "noise" at the U-indel and RNA editing (RE) sites of the transcripts detected by direct sequencing of RT-PCR fragments.Five psbF sequence variants were obtained after sequence analysis of cloned RT-PCR amplicons.One predicted sequence, which could not be detected by sequencing of clones, is marked with a star.doi:10.1371/journal.pone.0129396.g003

Table 1 .
RNA editing pattern in the CP genome of mungbean accession RIL59.