AA and SM conceived and designed the experiments. AA and SM performed the experiments. AA and SM analyzed the data. AR contributed reagents/materials. AA, AR, and SM wrote the paper.
The authors have declared that no conflicts of interest exist.
RNA editing by adenosine deamination generates RNA and protein diversity through the posttranscriptional modification of single nucleotides in RNA sequences. Few mammalian A-to-I edited genes have been identified despite evidence that many more should exist. Here we identify intramolecular pairs of Alu elements as a major target for editing in the human transcriptome. An experimental demonstration in 43 genes was extended by a broader computational analysis of more than 100,000 human mRNAs. We find that 1,445 human mRNAs (1.4%) are subject to RNA editing at more than 14,500 sites, and our data further suggest that the vast majority of pre-mRNAs (greater than 85%) are targeted in introns by the editing machinery. The editing levels of Alu-containing mRNAs correlate with distance and homology between inverted repeats and vary in different tissues. Alu-mediated RNA duplexes targeted by RNA editing are formed intramolecularly, whereas editing due to intermolecular base-pairing appears to be negligible. We present evidence that these editing events can lead to the posttranscriptional creation or elimination of splice signals affecting alternatively spliced Alu-derived exons. The analysis suggests that modification of repetitive elements is a predominant activity for RNA editing with significant implications for cellular gene expression.
A computational analysis of human RNA has identified 1,445 transcripts are edited mainly within non-coding Alu repeats, with the potential effect of regulating alternative splicing.
On the molecular level, the complexity of higher organisms is based on the number of different gene products available for structural, enzymatic, and regulatory functions. Posttranscriptional and/or posttranslational mechanisms have an important role in generating RNA and protein diversity (
Currently it is not known if the recoding of mRNAs at single codon positions is the main function of A-to-I RNA editing or if other types of editing events with as yet unknown roles in the regulation of gene expression are more widespread. The recently reported embryonic lethality in mice with ADAR1 deficiency indicates that additional substrates for this enzyme exist that function during early embryonic development (
A recurring theme of edited sequences is the involvement of an imperfectly dsRNA foldback structure (
Despite recent progress in identifying additional genes that undergo RNA editing (
In this study we identify a minimum of 1,445 edited human mRNAs present in existing databases. Clusters of adenosine-to-guanosine (AtoG) discrepancies in these cDNAs are the result of RNA editing involving intramolecular pairs of inverted Alu repeat sequences, repetitive elements that represent approximately 10% of the human genome and are concentrated in and around genes (
We also characterize functional consequences of the observed editing events and the factors that determine editing levels in Alu repeats and their modification patterns. The prevalence of Alu elements in primate genes, together with our experimental and computational analysis, suggests that the vast majority of primary human gene transcripts (greater than 85% of RNAs with average structure) are subject to A-to-I RNA editing. We show how editing might influence the alternative splicing of exonized Alu elements and discuss the implications of this extensive modification of mRNAs bearing repetitive elements for the regulation of gene expression.
A hallmark of an A-to-I RNA editing event is an AtoG transition when comparing genomic and cDNA sequences of the affected gene since inosine base-pairs with cytosine and therefore is replaced by guanosine during reverse transcription and PCR amplification. However, AtoG discrepancies between genomic and cDNA sequences can also be due to single-nucleotide polymorphisms (SNPs) or errors in databases. Therefore the search for edited sequences on a genome-wide basis is not feasible solely based on this single feature. However, in some cases of editing, not a single, but a cluster of AtoG discrepancies between genomic and cDNA sequences is evident within a stretch of a few hundred nucleotides (
In an initial screen for candidate genes, we used the Human Unidentified Gene-Encoded (HUGE) database of ca. 3,000 human cDNAs derived from the Kazusa cDNA sequencing project (
aNot all cDNA regions with A/G discrepancies were analyzed
bSNPs according to National Center for Biotechnology Information SNP database
cClone hh15303
Alu elements are short interspersed elements found in all primates, which are approximately 300 nt in length (
In order to better understand the connection of Alu's with the observed AtoG clusters, we analyzed experimentally the cDNAs from all 25 candidate genes for RNA editing in human brain. Total RNA and gDNA were isolated from the same human brain specimen to eliminate false positives from unmapped A/G SNPs. For all 25 genes in vivo RNA editing was detected by single-run sequencing of gene-specific RT-PCR products, and for five of them the editing efficiency was quantitatively evaluated through repeated experiments. Extents of editing ranged from less than 2% to 90% at individual sites (
(A) Schematic representation of LUSTR (GPR107, KIAA1624) gene structure around edited exon 15a. The AluSx repeat element in intron 15 and the exonic, inversely oriented AluJo are predicted to form an intramolecular foldback structure as depicted below (MFold software). TM, exonic regions predicted to encode transmembrane domains; *, editing sites.
(B) Editing analysis of exon 15a (sequence in capital letters) and flanking regions. The two major editing sites predicted to change amino acids (H/R and Q/R) are indicated. Editing levels in brain (filled column) and lung (open column) are shown above each edited nucleotide. The splice acceptor site subject to editing is underlined.
(A) The alternatively spliced exon 22a and surrounding region of the BTKI (KIAA1417) gene with two Alu elements and its computer-predicted foldback structure.
(B) Editing analysis of the AluSx- element with the exonic sequence in capital letters and edited A's in bold. The alternative splice acceptor site is underlined with a dashed line; the additional alternative consensus splice acceptor site, which undergoes editing, is underlined with a solid line.
(C) Gene architecture and Alu foldback structure of KIAA1497. The brain-derived cDNA of KIAA1497, also known as LRRN1; (
KIAA0500 is a cDNA of 6,577 nt in length cloned from human brain (AB007969) with a predicted open reading frame of 213 amino acids. Four AtoG discrepancies were present within the coding region of which two lead to an amino acid change (Q/R and S/G, respectively).
(A) Structure of the KIAA500 mRNA with location of Alu elements indicated and the predicted RNA secondary structure according to the MFOLD algorithm. Large open box indicates predicted open reading frame. *, editing sites.
(B) Editing analysis of an exonic Alu element in KIAA0500. Editing sites predicted to change amino acids are indicated. Our analysis revealed a significant percentage of editing (%G) at the nucleotide positions 3518 (27% ± 3%), 3522 (20% ± 3%) and 3625 (6% ± 1%) and additional editing sites with less than 5% editing, whereas parallel analysis of human gDNA confirmed the presence of adenosine at these positions. Editing levels in brain (filled column) and lung (open column, where detectable) are shown above each edited nucleotide.
Since a prerequisite for A-to-I RNA editing is the presence of a partially base-paired RNA foldback structure (
Schematic presentation of the gene structures from (A) P53, (B) SIRT2, (C) NFκB, and (D) SPG7. Edited repeat elements are marked by asterisks. RNA folds appear as calculated with MFOLD. The AluJb− in p53 is located in the 3′-UTR (A); all others are intronic. *, editing sites.
Because of the abundance of Alu elements in human pre-mRNAs, most primary transcripts contain one or more pairs of oppositely oriented Alus. If a majority of them is indeed subject to A-to-I RNA editing in vivo, it should be possible to predict RNA edited genes by identifying inverted pairs of Alu repeats in pre-mRNA transcripts. As a proof of principle, the analysis was extended to four arbitrary chosen genes (p53, SIRT2, NFκB, and paraplegin (SPG7) containing pairs of Alu repeats as seen schematically in
Many primary gene transcripts allow several energetically favorable foldback structures to be predicted for a given Alu that involve different combinations of Alu pairs. Do all these alternative Alu-pair foldback structures exist in vivo and are therefore subject to RNA editing? To address this question we examined the editing pattern of the G-protein-coupled receptor 81 (GPR81; identified through a computational search as described below). GPR81 contains four Alu elements, one sense and three antisense oriented, in the 3.6-kb pre-mRNA and was selected based on Alu repeat configuration and transcript length. If the alternative foldback structures depicted in
(A) The position and orientation of all four Alu elements in GPR81 pre-mRNA is indicated. Three alternative Alu pairings (I–III) are predicted and experimental editing analysis indicates that all three do form in vivo. ORF, open reading frame; *, editing sites.
(B) Editing analysis of AluSp+ in GPR81. Percentages of editing in human brain are indicated. The exonic sequence appears in capitals. The edited AT dinucleotide that becomes a splice donor site is underlined.
These results suggest that Alu elements in human mRNAs are subject to RNA editing by ADARs because of foldback structures formed between two oppositely oriented Alus present within the same primary transcript.
Exonic Alu repeat elements are predominantly located in the 5′- and 3′-UTRs of mRNAs, and as a result, most cases of Alu editing occur in noncoding regions. However, some editing events predict amino acid changes (
The LUSTR1 cDNA codes for a G-protein-coupled seven-transmembrane receptor (also termed GPR107 or KIAA1624), with three AtoG discrepancies located within an alternatively spliced AluJo-derived exon that leads to the in-frame insertion of 29 amino acids between transmembrane regions V and VI of the protein (see
Analyzing the RNA editing pattern of LUSTR1 pre-mRNA revealed additional intronic editing sites, one of which represents the splice acceptor adenosine (AG to IG) in intron 15 (22% edited in brain; see
A picture similar to LUSTR1 emerges from analysis of the gene for human inhibitor of BTKI (also termed KIAA1417;
The analysis of GPR81 revealed another case of Alu exon alternative splicing and, surprisingly, a new mechanism showing how RNA editing might affect RNA processing. Within the AluSp+ element located in the 3′-UTR of GPR81 transcripts a splice donor site (AT to IT) is generated in 57% of primary transcripts by RNA editing. This is predicted to give rise to alternatively spliced mRNA products represented by GenBank entry AF385431 (see
It is intriguing that we find cases where editing in alternatively spliced Alu exons, or within adjacent splice sites, interferes with or counteracts exon formation of an Alu repeat. It suggests that RNA editing might be more generally involved in the regulation of Alu exonization. Recently, it has been shown that more than 5% of the alternatively spliced exons in the human genome are Alu derived (
Furthermore, RNA editing in Alus might be involved also in the generation of novel introns as seems to be the case in GPR81. Statistically, however, the exonization of intronic Alus would be much more frequent than the intronization of exonic Alus because of the abundance of Alus in introns.
The results presented above show that clusters of AtoG mismatches in cDNA/gDNA sequence comparisons represent an effective way to identify authentic editing cases with a low rate of false positives. Since all clusters of AtoG discrepancies mapped to repeat elements, we wondered how prevalent the editing of Alu or other repeat elements is in the human transcriptome. Therefore, we devised a database search procedure to identify pairs of inverted repetitive elements in human mRNAs exhibiting AtoG transitions.
Initially, a limited search was carried out for closely spaced (less than 2 kb) inverted pairs of human Alu, MIR, and L1 repeat elements that overlap with exonic sequences and for which an mRNA sequence can be found in GenBank entries. This search, involving about one-third of all repeat elements in the human genome, identified 71 mRNAs with exonic repetitive-element pairs (51 Alu, six L1, six MER, and eight MIR). From those mRNAs, 27 displayed clusters of AtoG changes, all in Alu elements. Fourteen of these genes were chosen for experimental analysis, and all 14 proved to be subject to A-to-I RNA editing (
aSNPs according to National Center for Biotechnology Information SNP database
bNot all cDNA regions with A/G discrepancies were analyzed
ND, not determined
We analyzed the total of 103,723 human mRNA sequences (from the University of California, Santa Cruz [UCSC] Genome database [
(A) Plot of the nature and number of mismatches within Alu and L1 sequences present in human cDNAs. For reasons of comparison the L1 mismatch numbers have been multiplied by 2.9 so that the non-AtoG mismatch count for Alu and L1 is identical. Transition mismatches AtoG, GtoA, CtoT, and TtoC are displayed together for comparison.
(B) Plotted are the total number of Alu sequences found in human cDNAs (first column) and the number of elements harboring AtoG and GtoA mismatches (second and last column). The third column indicates the high confidence set of edited elements (α = 0.000001).
While the finding that non-AtoG transitions (GtoA, CtoT, and TtoC) are approximately three times more frequent than transversions is in line with results from previous studies analyzing gDNA sequences (
We then devised a statistical approach to distinguish repetitive elements that show AtoG mismatches due to sequencing errors and SNPs from those that have undergone A-to-I RNA editing. The method was based on the observation above that Alu elements subject to RNA editing undergo multiple base modifications that result in a cluster of AtoG discrepancies (5–30) between cDNA and gDNA. The probability that a cluster of several AtoG discrepancies is due to sequencing errors or SNPs (in the absence of an increased number of other nucleotide discrepancies indicating low-quality sequence data) is negligible. Thus the number of clustered AtoG changes can be used to distinguish genuinely edited elements from elements with aberrant or non-editing-related base changes. For each Alu element with AtoG discrepancies, we computed the χ2 test comparing the observed number of AtoG discrepancies with the expected number, based on the number of non-AtoG mismatches present in the same sequence. Elements with a χ2 higher than the critical value for α = 0.00001 (corresponding to a probability of one in 100,000 that the observed AtoG transitions are due to SNPs or sequencing errors) were selected as “edited” and will be called so throughout the rest of the manuscript. Using this approach we found that out of those 17,406 mRNAs with one or more exonic Alu elements, 1,445 (8.0%) mRNAs are edited within one or more of the Alu sequences (for a full list of edited mRNAs see
In order to validate our screening approach, we performed an identical analysis for GtoA, CtoT, and TtoC mismatches. Compared to the 1,925 AtoG-edited Alu elements in mRNA, we found 12 GtoA, 11 CtoT, and 11 TtoC cases of “editing.” These cases may represent false positives and thus set the error level of our screen to less than 0.6%.
These results suggest that out of the 103,723 human mRNAs at least 1.4% are A-to-I edited within an exonic Alu element. Apart from Alu repeats, many more low- and high-frequency repeats exist in the human genome (
Most Alu repeats are located in introns, and it is there where the bulk of RNA editing is expected to occur. The average number of Alu repeats per gene is 12.4 estimated for Chromosomes 21 and 22 (
To gain insight into the factors that determine which Alus are subject to RNA editing, and under what circumstances, the identified set of 1,925 high-confidence cases of editing in Alu elements (contained in the 1,445 mRNAs listed in
It was assumed that the observed editing is the result of RNA foldback structures formed between intramolecular inverted Alu repeats, as we have demonstrated for all the experimentally analyzed cases. If this hypothesis is correct, then the distance between an Alu and its closest inverted pairing element should be a critical determinant for how likely it is that a given element will be targeted by the RNA editing machinery. To test this hypothesis the closest inverted Alu was identified within the same gene for all 31,666 Alu elements. A properly oriented element was found for 19,231 of those Alu elements, and a plot was made showing the percentage of edited Alus as a function of the distance between elements (
(A) Percentage of edited elements classified in bins according to the distance separating the element and its closest inverted Alu partner.
(B) Percentage of edited Alu elements clustered according to divergence from their corresponding Alu-subfamily consensus.
(C) Percentage of edited Alu elements in each Alu subfamily.
In (A), (B), and (C) the numbers at the bottom of the bars show the sample size in each bin. (D) Percentage of edited elements according to the tissue from which the RNA was isolated. Error bars show 95% confidence levels.
The distance dependence of the extent of editing clearly suggests that the formation of Alu–Alu stem loop structures predominantly results from intramolecular Alu inverted repeats with an upper limit of approximately 1% editing that could be due to intermolecular Alu pairings. To our knowledge, these results describe for the first time the distance relationship of long-range RNA folding interactions in vivo and how their stability is influenced by distance.
More important, considering the high frequency of Alu elements in primate RNA sequences and the low levels of potential intermolecular editing observed, we conclude that intermolecular duplexes between complementary RNA sequences do not form in the nucleus at a significant rate. This raises the question of how the regulation of thousands of human messages proposed by Yelin and colleagues involving antisense transcripts works (
The editing of RNAs by ADARs has been shown to be dependent on the double-stranded character of the substrates, such that editing levels and promiscuity increase with the extent of the base-paired region (
The pool of mRNAs used in this study represents a heterogeneous collection of sequences from different tissues and cell types. In analyzing the editing of Alu elements as a function of tissue origin (
The pool of edited Alu elements was analyzed for other features that might influence editing levels, such as the position of the edited Alu within the mRNA (3′-UTR, 5′-UTR, and coding region) or its orientation in relation to the mRNA (sense, antisense). No significant correlation of Alu editing was detected with any of these features (data not shown).
The availability of such a large collection of A-to-I edited sequences resulting from this analysis allowed us to examine the modification pattern of edited Alu elements for potential editing hot spots or base preferences. To this end we first aligned all 141 edited Alu sequences (greater than 260 bp) in RNAs originating from Chromosome 1 and mapped the edited sites on their consensus sequence (
(A) The consensus sequence of 141 edited full-length Alu elements present within human Chromosome 1 transcripts with the number of editing events indicated for each sequence position (bars). Insertions and deletions present in fewer than five elements are not shown in the alignment for clarity. Bases conserved in more than 80% of the sequences are boxed. For the lesser conserved consensus positions the next most frequent base is listed below. Consensus CpG dinucleotides are in bold. Arrows indicate “high-efficiency” positions where more than 20% of adenosines present appear to be edited. Note the overlap of these positions with CpGs. Major features of Alu sequences, such as the A-Box and B-Box of Pol III and the Alu polyA sequence are labeled.
(B) A typical Alu foldback structure and its major features as discussed in the text. Arrows indicate TA hot-spot positions. The magnifications show the two typical configurations of editing sites found in Alu pairs: mismatched A/C bulges (i) and A/U base pairs (ii).
As a result of the high CpG mutation rate, frequently the Alu foldback structure of the unedited RNA is predicted to carry A–C mismatches at these positions. Editing at these sites restores the CpG repeat (CA→CI) on the RNA level and converts the A–C mismatch to an I–C base pair. Energy calculations for several predicted Alu pairs show, surprisingly, that the stability of the foldback structure is not diminished by editing but often increased because of the high frequency of I/C pair formation (data not shown). It is therefore unlikely that in the case of Alu foldback structures, RNA editing serves to resolve RNA secondary structures that interfere with the processes of splicing or translation of these RNAs, as suggested previously (
While the above analysis shows the qualitative features of the editing sites in Alus, determination of
Tables (i) and (ii) show the frequency of A, G, C, T, or an A/G editing site at positions −2, −1, 1, and 2 relative to each of the 14,774 AtoG mismatch sites found within the high confidence group of Alu elements (i) and in relation to a randomly chosen adenosine from each of the those sequences for each AtoG mismatch (ii). Table (iii) shows relative editing preferences after bias removal by subtracting table (ii) from Table (i). (iv) Graphical representation of Table (iii).
Considering the available data on in vitro editing activities of ADARs on dsRNA molecules of different sequences and structures, it is not surprising that highly base-paired RNA foldback structures such as the ones induced by Alu inverted repeats are substrates for the editing enzymes. However, it is remarkable and maybe surprising that these predicted structures are edited in vivo at significant levels. This indicates that many of these structures do form in vivo and are readily accessible for ADARs in the nucleus.
Alu elements are ideal for the formation of editable RNA structures because of their large numbers, size, and degree of conservation. We find no evidence for a sequence or otherwise specific interaction of the editing machinery with Alu sequences. Thus, other repetitive elements able to form similar structures should also be targets of A-to-I editing. Our data suggest, however, that editing levels in all other major repeat-element families that dominate the human genome (LINE, LTR, and other short interspersed elements) are very low compared to editing levels seen in Alu repeats (see
For mRNA fractions, we estimated the inosine content due to Alu editing as follows: In 103,724 mRNAs we found 23,204 AtoG mismatches, while the same sequence sample has an average for the other transitions of 3,271. Assuming an average mRNA size of 4 kb, the ratio of inosine in the sample is estimated to be one inosine every 20,814 nucleotides (103,724 × 4,000/[23,204–3271]) generated by editing in Alu sequences. This estimation for Alu editing is in the range of one inosine in 17,000 nt (brain), one in 33,000 nt (lung, heart), to one inosine in 150,000 nt (skeletal muscle) as was experimentally determined by Bass and colleagues in the polyA-fraction of rat RNAs (
While a significant amount of editing occurs in mRNAs that contain repetitive elements in their exons, our results predict that the bulk of A-to-I editing takes place in intronic sequences missing from cDNA databases. This is suggested by the experimental results regarding the LUSTR, GPR81, p53, SIRT2, NFκB, and paraplegin genes, for which intronic data was available (see
It has been shown that hyperedited, inosine-containing RNAs are retained in the nucleus by a protein complex containing the inosine binding protein p54 (
A connection between A-to-I RNA editing and RNAi has recently been suggested through studies in
The work presented here has been based on the analysis of cellular mRNAs that contain Alu repeat elements. However, the underlying principles probably also apply to Alu RNAs generated from transcriptionally active Alu elements. Alu elements do not encode transcription termination signals (
A recent study by
The two approaches are overlapping as well as complementary. Taken together, they have probably uncovered the most significant part of the heavily edited exonic sequences for which sequence data are available. From our analysis we estimate an additional approximately 4,000 edited Alu elements besides the 1,925 Alus that we have selected as a very high confidence set. Thus, it is important to note that the heavily edited sequences represent the tip of an iceberg with many more mRNAs in the human transcriptome being edited at single or a small number of positions.
Human brain samples were provided by the Harvard Brain Tissue Resource Center, Belmont, Massachusetts, United States; human lung cDNA was from Clontech (Palo Alto, California, United States). Total RNA isolation and reverse transcription have been described previously (
For analysis of the pool of human cDNA sequences we developed a program named Procedures for Repetitive Element Foldback Analysis (PREFA). We used the set of cDNA sequences from the UCSC database (July 2003) comprising 103,723 sequences (after removal of duplicate entries). The set of repetitive elements (for Alus 1,163,041 unique elements) and related information of the human genome (created with RepeatMasker based on the Repbase [
The RNA and genomic sequence for each element was extracted and compared base by base for mismatches. A small number of cases with very high non-AtoG mismatches (greater than 20/element) were discarded as misaligned or erroneous. From the repetitive elements showing at least a single AtoG change we selected those where mismatch distribution cannot be accounted for by SNPs and sequence errors using the following procedure:
The overall expected ratio of AtoG discrepancies relative to the total number of mismatches was calculated from the whole sample, assuming the expected AtoG mismatches to be approximately equal to the average of the rest of the transitions:
The expected probability for an AtoG mismatch at a single position in a given element was calculated from the total number of mismatches found in the element in cases where other mismatches were present (2) or from the whole sample where only AtoG mismatches were found (3): Here nAtoG and nOther is the total number of AtoG and non-AtoG mismatches found for this element:
Given the probability
A χ2 test was calculated for each element and those with a χ2 value exceeding the critical value (for α = 0.000001) were selected as edited, and these values correspond to approximately more than five AtoG changes in the absence of any other change in the approximately 300 bp of an Alu).
For each element in the high-confidence set the closest inverted element was identified among the elements present in the same gene boundaries. The distance separating the pair was calculated from the location of the first base of each element, according to the genomic sequence numbering and irrespective of their orientation. The divergence of each element was derived from the corresponding entry in the UCSC annotation database (ChrN_rmsk) representing mismatches per hundred bases. Tissue of origin of the RNAs was also derived from the UCSC mRNA annotation. For RNAs described to originate from multiple tissues, the corresponding RNAs were included in the count for each of those tissues. RNAs originating from a specific subregion of a tissue, such as subareas of the brain, were counted within the subregion but not in the whole-tissue set of RNAs.
Alignment of the Chromosome 1-derived Alu sequences was performed with the MegAlign program of the DNASTAR (Madison, Wisconsin, United States) package (Lasergene) using the CLUSTAL algorithm (
The database lists the GenBank accession numbers, gene names, gene product description, chromosome location, and type of Alu element and location within the mRNA sequence, the identity of the most likely pairing Alu elements within the same gene, and the distance in base pairs (bp) between the pairing Alus. The positions of all predicted editing sites within the individual sequences can be viewed by pasting the accession number into the USCS genome browser (
(276 KB XLS).
The GenBank ((
The Entrez Gene (
We thank Jessica Rosenkrantz and Kieran Pechter for technical assistance and Chris Burge for useful discussions. AA was supported in part by Human Frontier Science Program.
adenosine deaminase acting on RNA
Bruton's tyrosine kinase
double-stranded RNA
expressed sequence tag
genomic DNA
Human Unidentified Gene-Encoded
Procedures for Repetitive Element Foldback Analysis
RNA interference
small interfering RNA
single-nucleotide polymorphism