Variable Frequency of Plastid RNA Editing among Ferns and Repeated Loss of Uridine-to-Cytidine Editing from Vascular Plants

The distinct distribution and abundance of C-to-U and U-to-C RNA editing among land plants suggest that these two processes originated and evolve independently, but the paucity of information from several key lineages limits our understanding of their evolution. To examine the evolutionary diversity of RNA editing among ferns, we sequenced the plastid transcriptomes from two early diverging species, Ophioglossum californicum and Psilotum nudum. Using a relaxed automated approach to minimize false negatives combined with manual inspection to eliminate false positives, we identified 297 C-to-U and three U-to-C edit sites in the O. californicum plastid transcriptome but only 27 C-to-U and no U-to-C edit sites in the P. nudum plastid transcriptome. A broader comparison of editing content with the leptosporangiate fern Adiantum capillus-veneris and the hornwort Anthoceros formosae uncovered large variance in the abundance of plastid editing, indicating that the frequency and type of RNA editing is highly labile in ferns. Edit sites that increase protein conservation among species are more abundant and more efficiently edited than silent and non-conservative sites, suggesting that selection maintains functionally important editing. The absence of U-to-C editing from P. nudum plastid transcripts and other vascular plants demonstrates that U-to-C editing loss is a recurrent phenomenon in vascular plant evolution.


Introduction
In land plants (Embryophyta), plastid and mitochondrial transcripts undergo a type of posttranscriptional processing called RNA editing, which converts specific cytidines to uridines (C-to-U) or uridines to cytidines (U-to-C) through undefined mechanisms (reviewed in [1][2][3]). Surveys of organellar RNA editing have revealed extensive variability in the frequency and type of editing among and within the major land plant groups, which includes seed plants (Spermatophyta), ferns sensu lato(Monilophyta), lycophytes (Lycopodiophyta), hornworts (Anthocerotophyta), mosses (Bryophyta sensu stricto), and liverworts (Marchantiophyta). Numerous RNA using the RiboMinus Plant Kit for RNA-Seq (Invitrogen) according to the supplied protocol.
The RiboMinus-treated organellar RNAs for both species were sent to the University of Nebraska Genomics Core Facility for 75 bp single-read sequencing on the Illumina GAII platform, generating 26.6 M raw reads for O. californicum and 23.6 M raw reads for P. nudum. Read quality was assessed using FastQC version 0.10.1 (http://www.bioinformatics.babraham.ac.uk/ projects/fastqc/). Adapter and low-quality sequences were trimmed from the O. californicum and P. nudum raw reads using cutadapt version 1.4.1 (https://code.google.com/p/cutadapt/) with modified parameters (-a AGATCGGAAGAGC-q 20-m 35) before further analysis.

An automated pipeline for edit site detection
All remaining data were mapped to the O. californicum and P. nudum plastid genomes using TopHat version 2.0.9 [39] with relaxed parameters (-N 4-read-gap-length 3-read-edit-dist 5 -I 5000-coverage-search). Transcription coverage maps were drawn from the TopHat mapping results in R v3.1.0 with both window-size and step-size of 10 bp, and the number of reads mapped per genomic region were calculated (S1 Table) using the BEDTools multicov command [40]. Mismatches were identified in the O. californicum and P. nudum plastid transcriptomes by comparing the mapped transcript reads to the reference genome sequences. To do so, a pileup mapping summary file was generated with the mpileup command in SAMtools version 0.1.19 [41]. All DNA:RNA mismatches were reported, including mismatches supporting putative C-to-U edit sites (seen as C:T mismatches for sense-strand genes and G:A mismatches for antisense-strand genes), mismatches supporting putative U-to-C edit sites (T:C and A:G), and all other types of mismatches attributable to some type of error rather than RNA editing (A:C, A:T, C:A, C:G, G:C, G:T, T:A, and T:G).
The read depth of the transcriptome mapping and the proportion of reads containing a mismatched nucleotide were calculated at each reference genome position by comparing the resulting pileup file with the reference plastid genome sequences. The pipeline required that all DNA:RNA mismatches have a minimum read depth of 3X and at least 5% of total reads or three individual reads supporting the mismatch, whichever is greater. Initial analysis on untrimmed data using these criteria ("trim0, >5%, min3" as black bars in Fig. 1) identified a large number of C:T and G:A mismatches, which is a clear signal of C-to-U editing in both species.
Additionally, the number of T:C and A:G mismatches was slightly higher than other mismatches, suggesting the possibility of U-to-C editing as well. However, there were also many identified mismatches that could not be caused by C-to-U or U-to-C editing, showing that many false positives were being identified by the automated approach.
Several modifications to the automated approach were evaluated in an attempt to reduce these false positive results (Fig. 1). To potentially eliminate false positives introduced by the higher rate of sequencing error and sequencing bias at the ends of reads (based on FastQC quality analysis), we compared the effect of trimming all reads by 6 bp or 12 bp at both ends (Fig. 1A). Although read trimming generally reduced the number of non-editing mismatches, it did not completely eliminate such sites. Furthermore, the different trimming treatments identified a slightly different set of C:T and G:A mismatches (due to slight differences in coverage at inefficiently edited sites that pushed the mismatch frequency just above or just below the 5% cutoff), raising the possibility that true edit sites could be missed from any single treatment, increasing false negatives. We also attempted to reduce the number of non-editing mismatches by increasing the minimum mismatch frequency threshold (Fig. 1B) or minimum depth-ofcoverage threshold (Fig. 1C). While the number of non-editing mismatches dropped at higher thresholds, they were not eliminated completely. Furthermore, the increased stringency of these thresholds caused C:T and G:A mismatches (consistent to C-to-U editing) to drop much more dramatically, suggesting a substantial increase in false negatives at increased thresholds. These results demonstrated that there were no settings that could effectively reduce the number of false positives without also increasing the number of false negatives.

Manual annotation of automated results to eliminate false positives
To minimize the number of false negatives with the automated approach, we retained all mismatches identified by all three read trimming treatments using the least stringent 5% mismatch frequency and 3x read depth thresholds. We then manually examined all mismatches by visually inspecting mapped reads using the SAMtools tview command. This inspection identified four factors that could provide alternative explanations for some DNA:RNA mismatches: 1) DNA heteroplasmy (i.e., sequence polymorphism among plastid genome copies), 2) errors in the transcriptome reads due to imperfect binding of random hexamers during cDNA preparation, as described previously [42,43], 3) mapping artifacts at exon/intron junctions due to the mismapping of spliced transcript reads onto the unspliced reference genome sequence, and 4) errors in the reference genome sequence.
To eliminate false positives from the automated approach, we used a defined set of criteria to identify DNA:RNA mismatches arising by one of the alternative explanations defined above. Heteroplasmic regions in the genomes were detected by mapping Illumina DNA sequence data from the plastid genome sequencing projects [38] onto the plastid genome sequences using Bowtie version 2.2.  Comparative analyses of RNA editing content, sequence effects, and editing efficiency To compare editing content among species, genes from O. californicum, P. nudum, Anthoceros formosae (AB086179), and Adiantum capillus-veneris (AY178864) were aligned with clustalW [45] and manually adjusted when necessary. Homology of edit sites among species were determined based on the alignments. Using these alignments, the effect of RNA editing on the encoded amino acid were scored as conservative if editing improved amino acid identity with other species or non-conservative if editing did not improve or decreased amino acid identity with other species. Editing efficiency for each edit site was scored as the fraction of reads that contained the edited nucleotide sequence.

An automated approach combined with manual inspection for edit site detection
To detect edit sites in the O. californicum and P. nudum plastid transcriptomes, we generated an automated bioinformatics pipeline that compared transcript reads to the plastid genome sequences that were previously sequenced from the same plants [38]. Although sequence mismatches consistent with RNA editing were most abundantly detected by this automated approach, it was clear that the pipeline also detected many false positive mismatches that could not be generated by RNA editing (Fig. 1), which implies that some of the editing-type mismatches may also be false positives. Modifications to the automated pipeline that altered the trimming of sequence reads (Fig. 1A), the minimal frequency threshold to detect mismatches ( Fig. 1B), or the minimum depth of sequence coverage to examine (Fig. 1C) were insufficient to effectively eliminate all obvious false positives (i.e., the non-editing mismatches) without also introducing numerous unwanted false negatives (apparent by the even larger drop in the number of mismatches consistent with editing). Thus, none of the automated threshold settings could provide a satisfactory tradeoff between the number of false positives and false negatives. To address this issue, we set the automated thresholds to the lowest stringency to minimize the number of false negatives, and then we manually examined all mismatches to examine the potential sources of error and to eliminate false positives.
Manual inspection of results identified several sources of error that generated false positive signals. First, heteroplasmy (i.e., sequence polymorphism among plastid genome copies) occurring in transcribed segments of the genome produces polymorphism among transcripts, which generated a mismatch signal in our automated approach. To eliminate this issue, mismatches occurring at heteroplasmic sites (identified by mapping Illumina DNA reads onto the genome sequences) were removed from the results. Second, there was increased sequencing error at the ends of reads, mostly likely caused by imperfect binding of random hexamers during cDNA library preparation [42,43]. To avoid this problem, we manually excluded all identified mismatches that were solely supported by reads where the mismatch occurred within 6 bp of the end. Third, several mismatches were caused by obvious mismapping of reads at exon/intron junctions and were also eliminated. Finally, there were a few sites where both DNA and RNA read mappings disagreed with the published genome sequence. Because the genomes and transcriptomes were sequenced from the same source plants, this discrepancy indicates an error in the genome sequence. These mismatches were removed from the results, and the genome sequence was corrected to the proper nucleotide. Altogether, our manual corrections eliminated all of the mismatches that could not be caused by C-to-U or U-to-C RNA editing, and also eliminated a small number of editing-type mismatches (S2 Table).

Diverse types and frequency of plastid RNA editing among ferns
After manual inspection of mismatch results generated by the automated pipeline, a total of 297 C-to-U edit sites and three U-to-C edit sites were identified in the O. californicum plastid transcriptome ( Fig. 2; S3 Table). The majority of sites (232/300) affect coding regions, and there is a strong preference for editing at second codon positions relative to first or third positions (Table 1). RNA editing alters the reading frame for eight genes, including the creation of five start codons (accD, ndhA, ndhD, ndhF, psaB) and two stop codons (chIL and petD) by Cto-U editing and the removal of an internal stop codon (ycf1) by U-to-C editing ( Table 1). The rps15 gene also appears to have a start codon created by editing, but the C-to-U change was supported by only a single transcript and was thus not included in our results. Outside of coding regions, a single edit site alters the trnV-GAC gene, which improves base pairing in the stem of the TψC arm (Fig. 3A). In addition, ten edit sites reside in six introns (clpPi71, clpPi363, ndhAi556, petBi6, rpoC1i432, and trnKUUUi37). At least one of these intronic edit sites, in domain V of intron ndhAi556, reconstructs an A:U base pairing that is conserved at the DNA level in related plants and appears essential for proper intron folding (Fig. 3B). The remaining 57 non-coding edit sites were found in intergenic regions, presumably affecting 5 0 or 3 0 UTRs. No edit sites were identified in any ribosomal RNAs.
In the P. nudum plastid transcriptome, a total of 27 C-to-U edit sites and no U-to-C sites were detected ( Fig. 2; S3 Table), including the previously identified site in the ndhB gene [46]. Of these sites, 24 reside in coding regions and preferentially alter the second codon position, but no start or stop codons are created for any gene ( Table 1). The three non-coding edit sites are located within intron clpPi71 and within the psaM-ycf12 and rpl21-rpl32 intergenic regions. No transfer RNAs or ribosomal RNAs were found to be edited.
A comparison of shared plastid RNA edit sites among the three ferns (Adiantum capillusveneris, O. californicum, and P. nudum) and the hornwort outgroup (Anthoceros formosae) shows a large amount of variation in the frequency and type of RNA editing among species (Fig. 4). Of the >500 different edit sites identified in the coding regions of at least one fern, only 50 are shared among more than one fern or with the hornwort. That most of the edit sites are not shared among species indicates a highly lineage-specific pattern of frequent gain and loss during fern evolution. The presence of plastid U-to-C editing in two of the three examined ferns and the hornwort outgroup is most parsimoniously explained by the presence of this process in the common ancestor of all ferns followed by loss from the plastid of P. nudum.

RNA editing efficiency correlates with functional effects and identifies potential mis-editing
To examine the relationship between the functional effects of RNA editing on a protein sequence and the efficiency of editing in the plastid, we classified editing events in O. californicum and P. nudum protein-coding genes based on whether they improved sequence conservation to homologous proteins from other species (conservative sites), decreased conservation to homologous proteins (non-conservative sites), or did not alter the encoded amino acid (silent sites), and then we calculated editing efficiency as the proportion of reads that contained the edited nucleotide in the plastid transcriptome (Fig. 5A). Overall, edit sites that improve sequence conservation are substantially more abundant and more efficiently edited on average than non-conservative and silent editing events. The rarity and generally low editing efficiency of the non-conservative sites in the O. californicum plastid transcriptome suggest that they result from mis-editing (Fig. 5B). For example, editing at position 1109 in the atpB gene changes a conserved serine to a leucine at only 5.1%  efficiency (6/118 transcripts), while an edit site at position 74 in the psaI gene changes a conserved serine to a phenylalanine at 15% efficiency (41/267 transcripts). Both events substitute a small, hydrophilic amino acid with a large, hydrophobic amino acid, which may have negative effects on protein folding and function. In two more extreme cases, premature stop codons are introduced in the clpP gene at position 136 with 6.7% editing efficiency (5/75 transcripts) and in rpl20 at position 172 with 6.0% editing efficiency (16/267 transcripts). These events, which truncate >75% of the CLPP protein and half of the RPL20 protein, would almost certainly abolish their functional activity.

Discussion
Our analyses have uncovered a large amount of variation in the frequency of C-to-U and U-to-C RNA editing among ferns. In addition, we demonstrated the complete loss of U-to-C editing from the plastid transcriptome of P. nudum. Together, these findings show that the frequency and type of RNA editing is highly labile in ferns. Importantly, the discovered loss of U-to-C editing from the P. nudum plastid transcriptome, along with the independent losses of U-to-C editing from seed plant and Selaginella plastids and potentially from additional fern lineages, reveals a recurrent pattern of loss of this process from vascular plants (Fig. 6). One caveat is that we cannot be absolutely certain that are no U-to-C edit sites in P. nudum. This same argument could also be made for the apparent absence of U-to-C editing from Selaginella and seed plant plastids. However, the overwhelming majority of the P. nudum plastid genome exhibits substantial transcription in our data set (Fig. 2), suggesting that few sites could have been missed due to low coverage. Furthermore, our analysis, which searched for very inefficiently edited sites (down to only 5% efficiency), is by far the most sensitive approach that has yet been performed for plastid RNA editing. Thus, while it is formally possible that a U-to-C edit site escaped detection because it is very inefficiently edited (<5% efficiency) or it is present in one of the very few lowly expressed (<3X depth of coverage) regions of the genome, we consider this to be unlikely.
The absence of any internal stop codons in many early diverging lineages suggests additional losses of U-to-C editing in ferns (Fig. 6). Because U-to-C editing often acts to eliminate internal stop codons in transcripts of essential genes, it is possible to predict the activity and relative abundance of U-to-C RNA editing in a species based on the presence and abundance of internal stop codons in otherwise intact and presumably functional genes, although it should be noted that the absence of internal stop codons does not guarantee that U-to-C editing is absent from the organelle. The available plastid genomes of most leptosporangiate ferns contain several internal stop codons in their genes [31], and the two sequenced plastid transcriptomes from A. capillus-veneris and Pteridium aquilinum confirm that U-to-C editing is abundant in these leptosporangiate ferns [29,30]. In contrast, there are no internal stop codons in most early diverging fern lineages, including Psilotales, Equisetales, Marattiales, or in the earliest diverging lineage of leptosporangiate ferns, Osmundales [31], suggesting that U-to-C editing may be Figure 6. Distribution of U-to-C RNA editing among land plants. A plus sign indicates presence, a minus sign indicates absence. A question mark indicates that the presence or absence of U-to-C editing was inferred based on the presence or absence of internal stop codons in essential plastid genes. Inferred gain (black bar) and losses (white bars) of U-to-C editing were mapped using parsimony onto a phylogeny depicting the currently accepted relationships among land plants [35][36][37]. doi:10.1371/journal.pone.0117075.g006 Variable Frequency of Plastid RNA Editing in Ferns absent from these lineages. Here, we verified that the P. nudum plastid transcriptome lacks U-to-C editing, consistent with the absence of internal stop codons in its genome. More RNA editing analyses are needed, particularly from Equisetales, Marattiales, and Osmundales, to better understand the dynamic evolutionary history of C-to-U and U-to-C editing among ferns and to determine whether U-to-C editing was lost multiple times during fern evolution.
Our results also show a clear distinction in the frequency and efficiency of editing at sites that increase conservation among homologous proteins compared with sites that are silent or decrease sequence conservation, which is consistent with many previous observations in other vascular plant organelles (e.g., [4-8, 19, 25, 27-29, 47, 48-51]), indicating that this a general pattern of RNA editing across vascular plants. That conservative edit sites are more abundant and more efficiently edited strongly suggests that these sites are necessary for optimal protein function, and, furthermore, that selection is acting to maintain editing at these sites. Conversely, because silent editing events do not alter the protein sequence, there appears to be little to no selection at the protein level to retain these sites or to maintain the editing factors that control their editing efficiency. The extreme rarity of non-conservative editing suggests that selection is acting to eliminate many of these sites and reduce the editing efficiency of those that remain, thus mitigating their presumably negative consequences. The few non-conservative editing events that remain may occur at sites that are less functionally important in the protein, in which case the editing effects are selectively neutral and behave like silent editing events. Because we extracted RNA from all above-ground tissues, including stems and leaves in O. californicum and stems and enations in P. nudum, the low editing efficiency at some sites may reflect tissue-specific differences in editing efficiency, as shown in some plant organelles [12,13,52,53]. Alternatively, it is possible that some of these silent and non-conservative events have a functional role, perhaps to regulate transcript stability, intron splicing efficiency, or the efficiency of editing at other sites in the transcript [47,54,55].
Edit sites in introns and intergenic regions are likewise rare and inefficiently edited on average, suggesting that most of these sites have no major function and are thus under little to no selective constraint to be maintained, similar to findings in other organellar transcriptomes [51,[56][57][58]. However, intronic editing in O. californicum (Fig. 3B) and other species [5,19,25,[59][60][61][62] appears in some cases to improve intron secondary structure, which is important to increase intron splicing efficiency. Intergenic edit sites in plant organellar transcripts may also have functional importance in some cases, such as regulating translational efficiency [7,19,63,64].
Finally, our examination of false positives and false negatives revealed a wide variety of issues that can arise when using RNA-seq data to detect edit sites in a plant transcriptome. Overly stringent cutoff values introduced many false negatives by eliminating low-coverage and/or inefficiently edited sites from the automated results, while biological and technical issues such as heteroplasmy, imperfect binding of random hexamers during cDNA synthesis, mismapping at exon/intron junctions, and simple errors in the genome sequence generated a large number of false positive signals of editing that had to be eliminated. To minimize the number of false negatives, we used relaxed search parameters that included sites with low depth of coverage (down to only 3x read depth) and inefficient editing (down to only 5% efficiency), although we acknowledge that sites edited at less than 5% efficiency or covered by fewer than three transcript reads would have been missed by our approach. Because these relaxed parameter values increased the number of false positives, we manually evaluated all detected mismatches to eliminate signals due to biological and technical issues. In fact, our manual examination of results eliminated all of the mismatches that could not be caused by C-to-U or U-to-C RNA editing, verifying that our manual approach is able to efficiently detect and eliminate false positives. In summary, our work demonstrates that automated detection approaches alone are not sufficient for accurate editing detection. It is imperative to manually curate the automated results to ensure that false edit sites are not reported and true edit sites are not missed due to the issues described above. Additional suggestions, such as sequencing from organelle-enriched RNA, using DNase I to ensure no DNA contamination, priming with random hexamers rather than oligo-dT primers, subtracting rRNA from the RNA preparation, and employing sequence aligners (such as TopHat) that can handle mismatches and splicing junctions, were recently proposed to ensure optimal analysis of plant mitochondrial transcriptomes [65]. Many of these considerations are equally applicable to the sequencing of plant plastid transcriptomes and were implemented in our study.