Genome-Wide Characterization of RNA Editing in Chicken Embryos Reveals Common Features among Vertebrates

RNA editing results in a post-transcriptional nucleotide change in the RNA sequence that creates an alternative nucleotide not present in the DNA sequence. This leads to a diversification of transcription products with potential functional consequences. Two nucleotide substitutions are mainly described in animals, from adenosine to inosine (A-to-I) and from cytidine to uridine (C-to-U). This phenomenon is described in more details in mammals, notably since the availability of next generation sequencing technologies allowing whole genome screening of RNA-DNA differences. The number of studies recording RNA editing in other vertebrates like chicken is still limited. We chose to use high throughput sequencing technologies to search for RNA editing in chicken, and to extend the knowledge of its conservation among vertebrates. We performed sequencing of RNA and DNA from 8 embryos. Being aware of common pitfalls inherent to sequence analyses that lead to false positive discovery, we stringently filtered our datasets and found fewer than 40 reliable candidates. Conservation of particular sites of RNA editing was attested by the presence of 3 edited sites previously detected in mammals. We then characterized editing levels for selected candidates in several tissues and at different time points, from 4.5 days of embryonic development to adults, and observed a clear tissue-specificity and a gradual increase of editing level with time. By characterizing the RNA editing landscape in chicken, our results highlight the extent of evolutionary conservation of this phenomenon within vertebrates, attest to its tissue and stage specificity and provide support of the absence of non A-to-I events from the chicken transcriptome.


Introduction
Directives 98/58/EC and 86/609/EEC. Animals were maintained under standard breeding conditions, being subjected to minimal disturbance. The farm is registered by the Ministry of Agriculture with the license number C37-175-1 for animal experimentation. The experiment was realised under authorization 37-002 delivered to D. Gourichon. Embryos were harvested for another study [33] after incubation in standard conditions.

Tissues dataset
The material used in this study for the embryo sequences dataset was previously described [33] [SRA study accession number: SRP033603]. Briefly, two chicken lines were crossed, Line 6 [34] and Line R - [35]. Twelve F1 were produced from 2 families: 8 embryos (embryonic day 4.5) and 4 adults from the same batch. Embryos were kept as a whole, while 3 adult tissues were harvested: brain, heart, and liver, after electronarcosis and immediate bleeding. Additional embryos were produced at embryonic days 4.5 (n = 8) and 15 (n = 8), from a cross between the same lines, and 3 embryonic tissues were harvested: brain, heart and liver. Genomic DNA and total RNA were concurrently extracted from the same samples of whole embryos or individual tissues (AllPrep DNA/RNA Mini Kit, Qiagen). RNA quality was measured by a BioAnalyzer (Agilent); all samples had a RIN (RNA Integrity Number) 9.9.

Sequencing
RNA sequencing. Libraries with a mean insert size of 200bp were prepared following Illumina instructions for RNA-Seq analysis, by selecting polyA+ fragments (TruSeq RNA Sample Prep Kit) from each sample. Samples were tagged to allow subsequent identification, amplified by PCR and quantified by qPCR (Agilent QPCR Library Quantification Kit).
A total of 8 embryo libraries were sequenced (paired-ends, 100 bp) in triplicate after multiplexing, on an Illumina HiSeq 2000 sequencer (Illumina, TruSeq PE Cluster Kit v3, cBot and TruSeq SBS Kit v3) by randomizing their position in 6 different sequencing lanes.
DNA sequencing. DNA from 8 embryos was sequenced after multiplexing on 5 lanes of Illumina HiSeq 2000. Library preparation (mean insert size 328bp), DNA quantification and sequencing (paired-ends, 100bp) were performed according to the manufacturer's instructions (TruSeq DNA Sample Prep Kit Illumina, Agilent QPCR Library Quantification Kit, TruSeq PE Cluster Kit v3 cBot TruSeq SBS Kit v3).

Computational analyses
When not specified, analyses were performed with in-house Perl and R scripts.
All studied samples were aligned independently with regard to their sequencing lane, and then merged by individual before identification of RNA/DNA differences.
Genomic sequences analyses. Sequences were aligned to the current chicken genome assembly (Gallus gallus 4) using the BWA program version 0.7.0, option aln [36]. Sequences were then filtered on mapping quality (MAPQ30). SAMtools rmdup command was used to remove possible PCR duplicates.
PolyA RNA sequences analysis. Sequences were aligned with Tophat software version 2.0.5 on the chicken reference genome Galgal4 as described in [33].
Sequences mapping uniquely on the reference genome, without PCR duplicates and with a minimum mapping quality of 30, were selected.
Identification of RNA/DNA differences. Sequences were locally realigned and recalibrated before SNP detection with GATK software version 1.6.11 and BamUtil (bam recab command). SAMtools software version 0.1.19 was used with mpileup utility to detect SNPs between DNA and RNA samples from each individual. We set a maximum coverage of 10,000 for each calling to take into account as many reads as possible in the calling. SNPs were detected independently on each biological replicate.
Editing detection. SNPs were analyzed from VCF files obtained from SAMtools mpileup detection. For each biological replicate, only variations where DNA was homozygous either for the reference allele or for the alternative allele, and where RNA was heterozygous, were kept.
Several successive filters were applied to consider a position as a putative RDD site. We first only considered positions with a sufficient depth, keeping only candidates presenting a minimum of 15 reads both in DNA and RNA alignments. To increase the likelihood of a site to be a true RDD position by avoiding a sample artefact, we set to 2 the number of biological replicates that must carry the same modification.
We then applied several filters inherent to technical bias due to highthroughput sequencing.
As there is an over-representation of mis-called SNP in read extremities [37][38][39], we discarded each RDD site for which the median position of the "edited" allele among reads overlapping them was in the 10 first or 10 last bases. In accordance with previous studies [28], we chose to consider only the distribution of the "edited" nucleotide position, to increase the stringency of the method.
The strand bias was also considered; to be kept, a RDD candidate must present a proportion of edited allele on the forward strand close to its proportion on reverse strand (delta0.5).
We checked the biallelic status of each selected candidate, a third allele being detected in less than 5% of cases was considered as a sequencing error.
An additional filter was applied to ensure that the alternative nucleotide frequency in DNA was null.
The functional consequence of each RDD in each transcript was predicted using the Ensembl Variant Effect Predictor (VEP) version 71 [40]. Non-coding splicing site regions were removed to take into account putative misaligned reads at these sites [31]. Then, positions belonging to homopolymers (n5) were removed because they may generate false positive candidates [13].
The chicken genome assembly still lacks several assembled regions, due to sequence assembly errors or missing fragments. A fragment detected as uniquely mapped may thus be present, with several polymorphisms, at genomic regions absent from the reference sequence, but present in the DNA reads from our samples. Therefore, a last filter was performed by searching the "editing site" (40 bp surrounding the candidate locus) in the DNA reads from samples thought to be edited. This pattern was searched with fuzznuc [41].

Validation assays and editing characterization
Sanger sequencing. We first checked the homozygous status of RDD sites by Sanger sequencing of DNA. The 8 biological replicates were tested. Primers were designed using Pyro-Mark Assay Design software to allow further cDNA pyrosequencing (S1 Table).
Pyrosequencing. RDD sites were tested on a Qiagen PyroMark Q24 sequencer. Primers were designed with PyroMark Assay Design software (S1 Table). PCR products were made using PyroMark PCR Kit (Qiagen). We performed the analyses through the PyroMark Q24 1.0.10 software with default analysis parameters.
Tissue and stage effects on the editing level were tested through an analysis of variance in a model taking into account tissues, stages and the interaction between tissues and stages for each tested candidate.
In Silico prediction of protein structure and function. To predict the putative effect of the editing conversions on protein structure and function, we used several bioinformatic tools: SIFT is based on sequence homology and the physical properties of amino acids (http://sift.bii. a-star.edu.sg/); PolyPhen2 uses physical and comparative considerations (http://genetics.bwh. harvard.edu/pph2/); MutationAssessor is based on evolutionary conservation of the affected amino acid in protein homologs (http://mutationassessor.org/); CHASM (computed using CRAVAT 3.0: http://www.cravat.us/) is based on the probability that a modification gives the cells a selective survival advantage; ProSMS predicts protein stability changes due to single amino acid modifications (http://babel.ucmp.umu.se/prosms/).

Sequences analysis
DNA and RNA sequences were obtained from the same samples of chicken embryos. On average, 141,534,451 DNA reads and 65,302,559 RNA reads were aligned and analyzed for each embryo. The genome coverage reaches 93% for DNA reads and 22% for the RNA reads. A summary of DNA and RNA sequences aligned to the Galgal4 chicken assembly is presented in Table 1.

Data filtering-detection of biases
The first step was to detect RDD sites, i.e. positions homozygous in DNA and presenting an alternative sequence in RNA. To consider a position as a potential candidate, we fixed a minimum read-depth threshold of 15 both in DNA and RNA alignments for each embryo. We kept only candidates for which the alternative nucleotide frequency in DNA was null ( Fig 1A). A total of 1,327 RDD sites met this criterion. The next filtering steps are aiming to avoid common pitfalls in sequences analysis thus increasing the robustness of the results (Fig 1). To avoid putative false positive due to an artifact present only in one sample, we only considered RDD sites detected in at least 2 biological replicates. We ended up with 324 RDD sites ( Fig 1B).
It has previously been shown that polymorphisms overrepresented in read extremities are likely to be false positives [37][38][39]. In order to avoid this bias, we considered only RDD sites that were, in median, not in the 10% extremities of RNA reads overlapping them ( Fig 1C). Two additional filters related to sequencing were applied: we removed candidates with an over-representation of one allele on one strand and discarded positions where more than one alternative nucleotide was found in proportions greater than 5% ( Fig 1C). A total of 112 RDD sites passed these filters. We then removed candidates in splicing sites from non-coding regions ( Fig  1D), and filtered for regions containing homopolymers ( Fig 1E). A total of 84 candidates remained. We applied a last filter by removing candidates harboring the "edited" pattern in the genomic DNA reads (Fig 1F). The goal was here to take into account putative candidate regions for which the corresponding DNA reads were present in raw sequence data, but unmapped or not mapped to the same position as the "edited" RNA reads. At the end of the analysis, we found 36 reliable RDD candidates (Table 2). A total of 17 chicken genes are potentially impacted by these RDD sites, knowing that one site can be associated with several genes and that we are probably missing non-annotated genes for candidates highlighted in intergenic regions. Interestingly, many of these candidates were organized in clusters, the 36 positions corresponding to 20 different genomic regions (Table 2). A total of 7 clusters, in 5 annotated genes and in 2 intergenic regions, could be counted, encompassing 12 to 1,439 bp. The distance between 2 clustered RDDs ranged from 3 to 807 bp, for a number of clusters comprised of between 2 and 5 detected sites.

RDD types
We distinguished canonical RDD (A-to-G and C-to-T) from non-canonical RDD (other base changes). As the sequencing process was not strand-specific, the complement bases of canonical changes were also considered as canonical (i.e. T-to-C and G-to-A).
When comparing our datasets before and after filtering, we observed a clear enrichment in canonical changes throughout successive filters, which was reassuring in terms of results accuracy (Fig 2). Before filtering, all possible base changes were represented, at a frequency ranging from 5% to 20% (Fig 2). Altogether, canonical base changes represented 50% of RDD candidates. After filtering, canonical base changes represented all modifications except one, at position chr6: 29787642. This non-canonical A-to-C position seemed to be the result of a    misalignment involving an alternative splice-site. This position was selected for pyrosequencing validation. Among the canonical modifications, we found only A-to-G or its complement T-to-C modifications, and no C-to-T conversion.
We then characterized the RDD candidates with regards to their putative functional features.

Functional RDD and tissue expression
Three RDD sites, located on CYFIP2, GRIA2 and COG3, were potentially functional, because these changes are non-synonymous, and thus potentially have deleterious effects on the encoded protein. Most of the remaining candidates are located in gene introns, upstream or downstream regions of genes (Table 2, Fig 3). By using 5 different in silico predictors of the amino-acids substitutions' putative effects, we showed that none of the 3 non-synonymous substitutions was likely to be deleterious (Table 3). These substitutions were localized in highly conserved regions of the proteins (S1 Fig). A striking observation is that the K/E editing site affecting the CYFIP2 gene changes an amino-acid conserved between all examined vertebrate species into an amino-acid which is coded without editing by the genomic sequence of Ray-finned fishes.

Characterization of candidates
We designed primers for 14 RDD candidates corresponding to 9 genomic regions, comprising missense variants, intron, upstream or downstream regions, intergenic positions, and the remaining non-canonical modification. We first confirmed the homozygous status of the 14 selected RDD sites in DNA by Sanger sequencing. Their RDD status was then tested by pyrosequencing, and 13 RDD candidates were confirmed as edited loci (Fig 4). It is interesting to note that the unique site not validated by pyrosequencing corresponds to the non-canonical RDD candidate.
A subset of 7 validated candidates was then tested in the other available tissues: individual heart, brain and liver tissues from three developmental times, comprising the same stage as the original HiSeq samples (day 4.5), an older embryonic stage (day 15), and an adult stage (11 months of age). Among these candidates, three are clustered on chromosome 13 (Fig 5B.abc) and two are clustered on chromosome 2 (Fig 5C.ab). These positions were tested for tissue and stage effects on editing levels (Table 4). Tissue effect and stage effect were significant for all candidates (p-value0.05), and an interaction between tissue and stage was also observed for all but one candidate. There was a clear effect of both tissue and stage on the editing level. Interestingly, for 5 candidates out of 7, there was a continuous increase in editing level with age, from about 50% to more than 80%, independently of the tested tissue (Fig 5). In both clustered regions ( Fig 5B and 5C), all candidates harbored the same profile and only differed by their editing level. For one candidate, chr1: 167109833 (COG3 gene , Fig 5Aa), the editing level increased during embryonic development and was less important in adult stage. On chr13: 10717577 (CYFIP2 gene, Fig 5A.b), editing was mainly present in brain, with an increase of its level over time, and was at low level in other tissues. Interestingly, the editing level was tissue-specific at every developmental time point, and increasing for most of the candidates from liver to brain (Fig 5).

Discussion
Among animals, RNA editing is well described in mammals at a whole genome scale, but similar studies were lacking in other vertebrates like chicken. The goals of this study were to screen the entire chicken transcriptome for editing sites, to characterize this phenomenon at different stages of development and tissues, and to extend the knowledge of its conservation among vertebrates.  To do so, we used DNA-Seq and RNA-Seq technologies, allowing us to screen the whole chicken genome for such events. This approach was used recently in several species to detect RDD [12-15, 26, 27, 42-44]. While a large number of new RDD sites was first described using this approach, in particular in humans with more than 10,000 sites observed [43], these results were then contested [28,29], suggesting that RNA editing is a limited process when taking into account possible high-throughput sequencing technologies biases. Later studies confirmed this questioning by finding many fewer RDD sites when stringently filtering the dataset [15,29]. For example, Pickrell and colleagues demonstrated that up to 94% of the 10,210 edited sites highlighted by Li and collaborators [43] were likely to be false positives.
We carefully looked at common analysis pitfalls when detecting RDD sites in our dataset. First, we applied a stringent filter by taking into account only RDD sites observed in at least 2 biological replicates. This filter ensures true biological phenomena are kept, and removes candidates due to individual-dependent artefacts (as specific sequencing errors or somatic mutations putatively not seen in DNA). We then filtered our datasets for known sequence analysis biases as described in the Materials and Methods section. At this step of analysis, we found 112 RDD candidates (Fig 1A, 1B and 1C). Another study filtering its datasets for the same biases, with a more stringent filter concerning the number of biological replicates (at least two thirds of replicates detected as edited) [15], found between 128 (in mouse liver) and 447 (in mouse adipose) RDD candidates at the same filtering step, i.e. more candidates than our results while their study was limited to the exome. It could constitute an argument in favor of the scarcity of RNA editing in chicken. We chose to be really stringent by keeping only RDD positions with a total absence of the alternative nucleotide in DNA. The final step, eliminating candidates for which the edited pattern was found in the DNA reads, removed 48 candidates.
At the end of the filtering steps, we kept less than 10% of putative RDD sites detected at the beginning of our study, which is similar to results obtained in recent studies, taking into account biases linked to high-throughtput sequencing [15,17].
Compared with previous studies, a distinguishable feature of our analysis is the search of the "edited" pattern in the DNA reads of candidates highlighted through RNA-Seq. In several cases, while the "edited" RNA reads map to a candidate region, the corresponding DNA reads do not map to the chicken genome: the RNA read can be mapped to a paralogous region due to the splitting of introns, while the original region is either absent from the genome or is carrying too many mismatches between our individuals and the reference sequence. This can be explained by an incomplete genome assembly and / or several regions with assembly errors. Indeed, the chicken genome assembly is still incomplete, especially regarding microchromosomes [45,46]. The false RDD status of many candidates due to DNA polymorphism in paralogous regions has already been highlighted [14]. A similar observation has been made by Piskol et al [30], leading to the conclusion that non-canonical editing sites are likely to be false positive RDDs.
At the end of the filtering steps, the number of RDD candidates was considerably reduced. Our filters were quite stringent, and we may have missed a few real positions. But as we still detected a false positive candidate through experimental validation, this high stringency was surely appropriate.
It appears that in whole chicken embryo, the number of robust editing sites is limited. Nevertheless, some sites known to be edited in chicken were not highlighted in our analysis, either because of reduced editing level, or low level of expression. This could be explained by the use of 4.5 days whole embryos, in comparison with adult tissues in targeted studies. The proportion of canonical RDD changes increased across filters, which is reassuring about the reliability of the pipeline: only one non-canonical change could be observed, shown not to be a true conversion, due to misalignments along an alternative splice site. The status of this false candidate was confirmed by pyrosequencing.
Every C-to-T conversion detected at the beginning of the pipeline was discarded at a later filter. Moreover, confirming previous studies of RNA editing in chicken, we could not find any editing in APOB transcripts [19]. This is in accordance with the absence of APOBEC1 in the chicken genome, as this enzyme seems to be required for C-to-U APOB RNA editing in vertebrates [5].
After the detection of RDD sites in the chicken transcriptome, our aim was to further characterize several interesting edited positions.
All the tested candidates were shared between the tissues studied in our analysis, with one candidate presenting a very low level of editing in the liver, whatever the stage (chr13 :  10717577, Fig 5Ab). But the editing level varies between the analyzed tissues and ages, confirming in our chicken model that RNA editing varies across times and tissues.
Interestingly, RNA editing levels change over time. As generally observed in mammals, with few exceptions [47][48][49][50], the A-to-I editing level increased during development, as it is the case for 6 out of 7 tested candidates. Even if the tissue and stage specificity of RNA editing is clear in candidates tested from cluster regions (Fig 5B and 5C), it is even more pronounced for tested candidates highlighted separately (Fig 5A). These time-and tissue-specific phenomena are not only due to the level of expression of ADARs [49,51] and more work is needed to decipher the spatio-temporal regulation of RNA editing. The low level of editing at embryonic stages in almost all the tested candidates could be explained by a putative importance for adequate embryologic development, as it was hypothesized for the GRIA2 Q/R site in mammals [49], even if it has to be confirmed.
As previously highlighted, our results confirm that editing at a particular position often comes with editing sites nearby, but no clear functional explanation has been proposed yet [49]. The regional sequence composition and RNA molecule tertiary structure seem to be involved in these clustered editing sites [52,53].
Another interesting result is that only a few candidates were directly affecting the protein sequence by changing an amino-acid. It has been shown that RNA editing can impact protein function, like modifying ions channels in some tissues [54,55], or impacting the ligand-binding affinity [56]. Nevertheless, our results show that RNA editing in chicken is more frequently silent, as already observed [57]. More studies should be performed to confirm these results. But a significant number of candidates are located in non-coding parts of the chicken genome, at least given the current state of the annotation. As in a previous study in human Alu regions [52], we observed a high number of edited sites in introns. Even if our data come from polyA + RNAs, these sites may correspond to editing in pre-mRNAs. But they may be part of noncoding RNAs too, where editing has been discovered in several species, and the biological significance of which is still largely unknown [57,58].
We highlighted 3 candidates that were previously described as edited in mammals, one K/E substitution already observed in the CYPIF2 gene [21], one I/V conversion located in the COG3 gene [16,17,59]-previously described in human, mouse and rat-and the R/G site in the GRIA2 gene [60], which means that these editing events are not restricted to mammals and appeared before the Sauropsid-Synapsid divergence. Possible implications of an altered editing efficiency at the R/G site in GRIA2 in mental disorders in human and mouse were recently observed [61]. Concerning the COG3 editing site, no functional implication is documented at this time, but as underlined in another study [16], the conservation of this site both in mammals and in a broader way in vertebrates implies a putative functional role. Similarly, the functional significance of the conserved CYFIP2 K/E editing, which is higher in brain than in other tissues in human, is not known, but may be implicated in apoptosis [48].
The editing sites are located in regions highly conserved between vertebrates (S1 Fig). Interestingly, the modification observed in the CYFIP2 gene results in a conversion from a glutamic acid to a lysine. This amino-acid is only present in the Ray-finned fishes, and is shared by all of them (http://www.ensembl.org/index.html). The other species for which the homologous CYFIP2 sequence is available are all K-coding at this position, which asks the question of the functionality of this E residue, only present in fishes as a chromosomal codon, but resulting from an editing phenomenon in several vertebrate species, including chicken.
The number of editing sites conserved among vertebrates seems to be very small, and this may be the signature of their functional importance. In this respect an extensive study recently identified only 59 evolutionarily selected sites among mammals [62].

Conclusions
This study constitutes, to our knowledge, the first whole genome screening of RNA editing in chicken. By using a stringent pipeline, we focused on reliable RNA editing events and thus removed most putative false positives, a big pitfall in RNA editing discovery by high-throughput sequencing. Our pipeline predicts reliable RNA editing sites, avoiding biases encountered when using NGS data, and most of the tested sites are confirmed through an independent validation method. This whole genome analysis shows that the A-to-I editing mechanism may be the only one present in chicken. Furthermore, we show strong tissue and stage effects on editing level with a tendency to increase with age. Several edited loci are conserved between chicken and other vertebrate species, including human, which indicates that, while RNA editing arose long ago in the evolution, some particular nucleotides from a few genes are subject to RNA editing. This conservation is probably linked to the molecular mechanisms involved, but more deeply questions the functionality of editing at these specific loci. Even if the extent to which RNA is edited is more and more fully characterized, a huge effort to discover the putative functionality of this phenomenon is still needed.