Characteristics of Transposable Element Exonization within Human and Mouse

Insertion of transposed elements within mammalian genes is thought to be an important contributor to mammalian evolution and speciation. Insertion of transposed elements into introns can lead to their activation as alternatively spliced cassette exons, an event called exonization. Elucidation of the evolutionary constraints that have shaped fixation of transposed elements within human and mouse protein coding genes and subsequent exonization is important for understanding of how the exonization process has affected transcriptome and proteome complexities. Here we show that exonization of transposed elements is biased towards the beginning of the coding sequence in both human and mouse genes. Analysis of single nucleotide polymorphisms (SNPs) revealed that exonization of transposed elements can be population-specific, implying that exonizations may enhance divergence and lead to speciation. SNP density analysis revealed differences between Alu and other transposed elements. Finally, we identified cases of primate-specific Alu elements that depend on RNA editing for their exonization. These results shed light on TE fixation and the exonization process within human and mouse genes.


Introduction
The draft sequences of the human and mouse genomes confirmed that transposed elements (TEs) have played a major role in shaping mammalian genomes [1,2]. Sequences of transposed elements comprise at least 45% of the human and 37% of the mouse genomes (Lander et al., 2001;Waterston et al., 2002). A large fraction of the TEs were inserted into transcribed regions, mostly within intronic sequences [3]. These intronic insertions contributed to the enlargement of intron size within mammalian genomes (Lander et al., 2001;Waterston et al., 2002). Sironi et al. identified constraints on insertion of TEs within introns [4] and showed that gene function and expression influence insertion and fixation of distinct transposon families in mammalian introns [5].
Exonization is the creation of a new exon as a result of mutations in intronic sequences [6], whereas intronization is the creation of a new intron. TEs have enriched the human transcriptome by exonizations [7] and intronizations [3]. In human, most of the exons that originated from TEs are from the primate-specific transposon called Alu. Alu elements are the most abundant repetitive elements in the human genome; there are upwards of 1.1 million copies, accounting for more than 10% of the human genome [1,8]. Alu elements are derived from the 7SL RNA [9]. The major burst of Alu retroposition took place 50-60 million years ago and has since dropped to a frequency of one new retroposition for every 20-125 births [10,11]. Alu-mediated mutagenesis, mostly through nucleotide insertions, has been estimated to be involved in close to 1% of Mendelian genetic disorders [12]. The occurrence of single nucleotide polymorphisms (SNPs) in and around Alu sequences has been discussed [8,13].
Makalowski and coworkers were the first to describe Alu elements within mature mRNA in human [14]. It is now clear that transposed elements are found within a large number of mature mRNAs [15]. The new exons generated from Alu elements are usually alternatively spliced; these exons comprise ,5% of alternatively spliced exons in the human transcriptome [16]. Exonized TEs that are alternatively spliced are not unique to human as most of the exonized TEs in the mouse genome are also alternatively spliced [3]. The molecular mechanism leading to Alu exonization has been well characterized. A typical Alu is around 300 nt and contains two similar monomer segments joined by an A-rich linker and a poly(A) tail-like region. Alus insert into introns of primate genes by retrotransposition, usually in the antisense orientation. Eighty-five percent of exonizations have occurred from the right arm in the antisense orientation [3,16]. The poly(A) tract of this arm in the antisense orientation creates a strong polypyrimidine tract (PPT). Downstream from this PPT a 39 splice site is selected and further downstream from that site (approximately 120 nt) a 59 splice site is recognized [17]. Without the left arm, exonization of the right arm shifts from alternative to constitutive splicing. This results in elimination of the evolutionary conserved isoform and may thus be selected against [18]. Only one or two mutations are required within intronic Alus that reside in antisense orientation relative to the coding sequences to yield a consensus 39 splice site [19] or 59 splice site [20]. The role of splicing regulatory sequences on the exonization process has also been studied [21,22,23]. The 39 splice site of exonized Alus are very similar to those of the 39 splice sites of mammalian interspersed repeat (MIR) exons [24].
Recent studies indicate that the pattern of splicing of exonized TEs differs among human tissues [25,26,27]. Additionally, there are variations in splicing patterns within individuals in the human population [28,29,30]. Certain SNPs correlate with heritable changes in alternative splicing but do not cause disease, thus indicating a link between genetic variation and mode of splicing [29,31,32]. Another study identified SNPs correlated with obesity that cause variation within alternative splicing patterns [28].
The exonization process is subject to many evolutionary constraints: New exons are generally alternatively spliced [7] and the inclusion rate is relatively low [16,19,33]. This implies that novelties added to established genes (within established coding sequences, CDSs) are under lower purifying selection if they do not interfere with the original coding sequence, compared to those events that change the original CDS. Also, exonization usually occurs in untranslated regions (UTRs) [3] or within duplicated genes [34], further supporting the idea that purifying selections are more intense on exonization events that occur within CDSs. Thus, alternative splicing of Alu exons enriches the human transcriptome with new mRNAs without eliminating the original, functionally important transcripts, which are generated via exon skipping [35].
Here we set to find additional characteristics of TE exonization events within human and mouse. We looked at the location of the exonizations within genes and the SNP densities, and evaluated SNPs that change canonical splice sites. We found that exonizations occur preferentially in the beginning of protein coding sequences. Moreover, we show that exonizations can be population specific. Our findings reveal a possible contribution of TE exonizations to population divergence within human and mouse.

The locations of TE exonizations within coding sequences
Non-symmetrical, conserved, alternatively spliced exons are more often located at the beginning of the CDS than elsewhere in transcripts [36,37,38]. We analyzed the Transpogene database of exons that originated from TEs [39] to determine whether there is a bias in their location within mRNA. We normalized the CDS length between 0 and 1 (see Materials and Methods) and compared, in increments of 0.1, the extent of TE exonization at different locations in human and mouse ( Figure 1). We found that exonized TE sequences are biased to reside in the first half of the CDS sequence compared to alternatively spliced cassette exons that did not originate from TE exonizations. Most exonizations in both human and mouse are found between position 0.1 and position 0.4 within the CDS, with a median location of 0.336 in human and 0.369 in mouse. No statistically significant differences were observed between the human and mouse populations or within different TEs families. Alternatively spliced cassette exons that did not originate from TEs are found at a median location of 0.513 and 0.507 in human and mouse, respectively. Statistically significant differences were observed between alternative cassette exons and TE exons (Wilcoxon Rank Sum test, p = 1.2244e-027 and p = 1.2322e-006 for human and mouse, respectively). These results imply that most TE exonizations tend to occur within the first introns of genes. In human non-TE alternatively spliced exons, 1353 out of 17,642 are the second exon, whereas in TEderived exons 233 out of 927 are found in the first intron and if spiced become the second exon; this difference is statistically significant (Fisher's exact test, p,10 242 ). The first intron is substantially longer, with respect to the other introns, in most human and mouse genes and shows higher rate of TE insertion [39]. The longer introns presumably provide a good environment for exonization [40]. Effects of TE exonization within the first intron are usually neutral with respect to the protein sequence, but can affect signal sequences [41].
In order to analyze whether the location bias results from potential involvement of purifying selection, we separated our data to three groups: exonizations that contain an in-frame stop codon (599 exons), exonizations that are non-symmetrical and do not contain an in-frame stop codon (216 exons), and symmetrical exons that do not contain stop codons (137 exons). The median locations within the normalized CDS of these three groups are 0.3062, 0.3795, and 0.4199, respectively. The Wilcoxon Rank Sum test showed that there is a statistically significant difference between the first and the third group (p = 0.0428) but not between the second group and the third group or the first group and the second (p = 0.2555 and p = 0.3641, respectively). This observation strengthens the hypothesis that the 59 position bias of TE exonization has a connection with the NMD machinery. We previously showed that non-symmetrical exons (not related to TEs) that are alternatively spliced in both human and mouse (and thus likely to be functional events) tend to be located near the 59 end of the CDS, whereas conserved symmetrical alternative exons are located throughout the CDS [37]. The current results show a statistically significant difference in location between symmetrical exons and those with in-frame stop codons. We hypothesize that TE-driven alternative exons are under purifying selection to be locate at the beginning of the CDS, presumably to enhance identification of the TE-containing mRNA by the nonsensemediated decay (NMD) system [42].

SNP density within intronic and exonized TEs
Identifying features shaping the architecture of sequence variations is important for understanding genome evolution and mapping of disease loci. A positive correlation was shown previously between Alu elements and SNPs density [13]. Analysis of the positive association between schizophrenia and a cluster of SNPs and haplotypes in the seventh intron of the b2 subunit of the type A c-aminobutyric acid receptor revealed that the Alu-Y near the 59 end of exon 8 contains as many 11 SNPs [43].
Here we set out to evaluate and compare SNP densities in all TE families from human and mouse. All positions of exons and introns of all genes as annotated in the Golden Path database and the positions of intergenic regions along with the number of SNPs in these regions were obtained and divided by the total length of the particular region. The dataset contained 39,288 human genes. For the human analysis of the SNPs, we evaluated 382,892 exons with 446,357 SNPs, 347,948 introns with 8,428,718 SNPs, and 8,899 intergenic regions with 10,395,717 SNPs. We also used 31863 mouse genes. For the mouse analysis we evaluated 301506 exons with 273700 SNPs, 270782 introns with 500541 SNPs, 8602 intergenic regions with 661474 SNPs.
Multiplying the resulting SNP densities by 100 yielded the SNP frequency per 100 bp. The average SNP density in the human genome is 0.43 in exons, 0.4 in introns, and 0.41 in intergenic regions. The similar densities of SNPs in exons, introns, and intergenic sequences were somewhat unexpected, as one might expect strong evolutionary pressure against substitutions in protein coding regions. This might be caused by a bias of the SNP data from dbSNP itself as EST data is the basis for many SNPs. In the mouse genome, the average frequency of SNPs is 0.31, 0.33, and 0.28 in exons, introns, and intergenic regions, respectively. These SNP densities are consistent with the number of SNPs observed in the baseline windows presented in Figure 2 for human TEs and in Figure 3 for mouse TEs. These results are in agreement with the SNP densities previously obtained from exons, introns, and intergenic regions in human and mouse genomic sequences [13].
As shown in Figure 2, the SNP density in primate-specific Alu elements is 0.53, which is higher than the baseline level. The density in Alu elements is the highest level observed among the different families of TEs. Alu elements are GC rich with 24 or more CpG dinucleotides per element. These dinucleotides are prone to mutation as a result of deamination of 5-methylcytosine. Only half of the SNPs in young Alu elements were found at CpG dinucleotides, however [8,20,44]. Also, analysis of the GC-rich Alu body separately from the AT rich Alu tail showed that both parts are enriched in SNPs [13]. Therefore, the GC content cannot be the sole determinant of this enrichment. For the L1 elements, the SNP density is similar to the baseline frequency, whereas the frequency is lower than baseline for the other families of TEs. A correlation of the age of the different Alu families with the SNP density shown by Ng et al. [13] suggests that the lower SNP density for L1 and the other TE elements might be related to their earlier integration into the human genome. However, we cannot rule out the option that there is not a simple correlation between the age of the TE and the number of SNPs. The primate-specific Alu element and the rodent-specific B1 element originated from the same 7SL RNA gene and share a high level of sequence identity. Nevertheless, the high SNP density detected in Alu elements was not observed in murine B1 elements ( Figure 3).
We then examined the SNP density in exonized TEs ( Table 1). The SNP density in exonized TEs from all TE families in the human genome is lower than the overall SNP density of all TEs,  but the difference is not significant (Mann-Whitney test, p = 0.382, two-tailed). An exception was observed in the CR1 (LINE-3) elements; exonized CR1 elements have a higher than average SNP density. However, only four CR1 elements were exonized so the sample size is very small. In mouse, for all transposed element families, the density of SNPs in exonized TEs was significantly higher than the overall density in all TEs (Mann-Whitney test, p = 0.004, two-tailed). In mouse, exonization seems to occur preferentially in areas with higher SNP density.

SNPs in the splice sites of exonized TEs may cause variation in the exonization process
In order to investigate the possibility that exonization of TEs creates transcriptomic diversity within the human population, we searched for SNPs that eliminate or create canonical splice site in a TE. Specifically, we looked either for changes in the invariant AG dinucleotide at the 39 splice site or the canonical GT or GC at the 59 splice site. Although there are other positions that might alter recognition by the splicing machinery, only the four positions must be fully conserved to ensure selection by the spliceosome. To enhance the fraction of bona fide exonization events we searched for exonized TEs that are supported by at least two ESTs. Our analysis revealed 10 SNPs in canonical splice sites of TE-derived exons in the human genome (Table 2); these SNPs eliminate change a canonical splice site into a non-canonical one (the ancestral nucleotides are also shown in Table 2). Of the ten, five are in the acceptor and five in the donor splice sites. Seven of the SNPs occur in splice sites of exonized Alu elements, two in splice sites of exonized L2 elements, and one in the splice site of an exonized LTR element. To ensure that we identified the sequence without the SNP correctly, we examined the sequences of the orthologous TEs in chimp (Table 2). Additional support for the role of SNPs in TE population-specific exonization is given by the ssSNPTarget database (http://sssnptarget.org/) [45], the SNPs rs2377301 and rs5758111 have EST evidence for exon skipping due the SNP modification. In the mouse genome, three splice sites of exonized TEs contain SNPs (Table 3). SNPs were found in the splice sites of an exonized B1 element, an exonized B2 element, and an exonized LTR element; all are within 59 splice sites. We searched the NCBI Database of Single Nucleotide Polymorphisms for population frequency data. Data were only available for two of the 10 SNPs observed in the human genome ( Table 4). One of them, SNP rs1721244, is located at chr2 position 73983403 and is the first nucleotide of the 59 splice site. The allele with G has a canonical splice site (GT) but the other allele has a non-canonical splice site (AT). Both splice sites occur at a frequency of more than 0.3 (Table 4); thus, this SNP, and associated splice variation, is common in the human population. In this analysis, we selected only cases in which SNPs clearly changed the sequence directly at the splice site. We did not take into account SNPs within other splice signals or within exonic or intronic splicing enhancers/silencers that might modulate the selection level of the exon. Thus, the effect of SNPs on splicing might be greater than observed here.
We have also built a dataset of TEs with non-canonical splice sites that appear to be active based on evidence of exonization from ESTs or cDNAs. We searched the SNP database for SNPs that might change the non-canonical splice sites into canonical ones. In the human genome, we found 45 SNPs that changed a non-canonical splice site into a canonical site (a GT/GC dinucleotide in the 59 splice site and an AG dinucleotide in the 39 splice site; see supplementary data Table S1). Only three such SNPs were identified in the mouse genome. As a result of these SNPS, these exons are flanked by canonical acceptor and donor splice sites, explaining their identification by the splicing machinery and their presence in the ESTs database.
Population frequency data were available for 11 of the 45 SNPs (see supplementary data Table S2). One interesting case is SNP rs231518 in an L1 element. There are six ESTs and cDNAs with the 59 splice site sequence AT, but the SNP rs231518 has a canonical 59 splice site GT. The two alleles have an intriguing evolutionary history. There is a G at the 59 splice site in chimp and orangutan and an A in rhesus. The sequences of chimp, orangutan, and rhesus were extracted from published sequences and the multi-species alignment of the SNP location was downloaded from UCSC genome browser [46]. We cannot exclude the possibility that A/G polymorphisms also exist within chimp, orangutan, and rhesus based on available data. The SNP rs231518 with the canonical dinucleotide 59 splice site GT is the most frequent allele in all human populations (G allele frequency of 0.792 in the CEU population, 1 in the HCB and JPT population and 0.937 in the YRI population, see supplementary data Table S2).

TE exons that depend on editing for their exonization
How new exons are created and established is an intriguing issue. Recently, Lev-Maor et al. [47] demonstrated that exonization of an Alu exon in the NARF gene depends on an RNA editing mechanism. In this case, editing from AA to AI activated the 39 splice site; inosine is recognized as G by the splicing machinery [48]. We searched for additional cases in which the 39 splice site of the exonized Alus is AA or the 59 splice site is AT, such that RNA editing to AG or GT, respectively, would produce a canonical splice site. We did not find any evidence for editing in 59 splice sites of Alu-derived exons. However, we found six cases of Alu exonization in which the 39 splice site contains an AA at the genomic level and EST sequences support exonization ( Table 5). Two of these cases were found in ESTs generated from brain tissues and another two were from immune system tissues, tissues that have high levels of RNA editing [49,50,51,52]. Two other cases were found in cancerous tissues and in kidney. The most convincing evidence of exonization of an Alu element resulting from RNA editing is found within a non-coding brain-specific gene NR_024561. This exonization is supported by a validated Refseq sequence and three additional cDNA and ESTs (all from brain tissues). Moreover, transcripts containing this exon have three additional A-to-I editing sites within the Alu-derived exon. Several potential editing sites are usually observed within a region that contains two Alu elements located in opposite orientation due to the formation of a long double-stranded RNA structure between the elements [52]. Interestingly, the nearest Alu to that exonized in the NR_024561 gene is in the downstream intron ( Figure 4). There is an Alu within the upstream intron but it is more than .2000 nucleotides away and is therefore unlikely to hybridize with the Alu exon [49,50,51,52]. NR_024561 appears to be a non-coding gene and is expressed exclusively in the brain. A BLAST search against the database of known non-coding RNAs NONCODE [53,54] revealed 85% identity (E value = 4e252) of the NR_024562 isoform to the MESTIT1 non-coding RNA [55]. This isoform also had 86% identity (E-value = 6e-48) to the brainspecific non-coding KLHL1 antisense RNA [56]; this RNA is involved in the spinocerebellar ataxia type 8 (SCA8) neurodegenerative disorder [57,58].

TE-derived exons are most often located near the 59 end of the CDS
Cassette exons that are non-symmetrical and conserved in both human and mouse are more often located in the 59 region of the coding sequence than in other regions [37]. Inclusion of nonsymmetrical exons is likely to cause a frame shift in the coding sequence, introducing a premature stop codon and activating nonsense mediation decay or producing an unstable protein  [36,38,59]. Most TE-derived exons are non-symmetrical [3,16] and are usually exonized from the first introns of a coding gene. We previously suggested that the majority of the TE-derived exons are non-symmetrical because they are still young in evolutionary terms and thus have not yet undergone purifying selection, which eliminates deleterious exonizations. Given a sufficient period of time, some of the currently non-symmetrical exons that are only mildly deleterious will eventually become symmetrical (through small deletions/insertions) and thus will add coding capacity into already established genes. Examples of functional TE-exonizations are exon 8 of ADAR2 gene [60] and exon 8 of NARF gene [47]. Nonsense codons in the 39 halves of genes may less efficiently activate the RNA degradation machinery than those found near the start of a transcript [42,61]; it may also be that longer peptides are more likely to be deleterious than shorter ones [36,37,38]. The first intron is usually longer than the others and thus following exonization the two flanking introns are still relatively long. Alternatively spliced exons are generally flanked by longer introns than are constitutively spliced exons [62]. It is also possible that the bias observed may be due to the fact that TEs are more often found near the start of genes than in other regions. These results suggest that the first intron with its longer size function of a ''buffer zone'' to the emergence of new potentially deleterious exons.

SNP densities vary depending on TE families
Alu elements were inserted into the human genome after the insertion of other families, such as MIRs, DNA transposed elements, and LTRs [64]. Alu elements show higher level of exonization than all other TE families [3]. Here we show that Alu elements tend to accumulate more SNPs than other TE families. The higher mutation rate in Alu elements is not correlated with their CpG enrichment [13,63]. There appears to be a correlation between the age of TE transposition and the mutation rate. A small fraction of L1 elements are still active in the human genome [64] and on average L1 elements contain a higher density of mutations than other analyzed families (L2, MIR, DNA, LTR). The average SNP density in TEs in the mouse genome is lower than the SNP density in the surrounding sequences. The SNP density in TEs in the human genome is at least 2-fold higher than that in mouse TEs. Artificial selection and inbreeding accompa-   nying the generation of laboratory mouse strains presumably serves to reduce genomic differences between individual mice. Therefore SNP data from mouse probably do not reflect real population dynamics.
Alu exonization is coupled to the RNA editing mechanism In our analysis, we found evidence for exonization of an Alu element that probably requires RNA editing. The NR_024561 gene is expressed exclusively in the brain. The exonized Alu element is from AluJo subfamily and it was inserted into this gene about 25 million years ago [65]. The 59 splice site dinucleotide GT is conserved in rhesus and gorilla but not in orangutan. The 39 splice site dinucleotide AA and the editing sites E1 and E3 are conserved in rhesus, orangutan, and gorilla ( Figure 4C). The editing site E2 is not conserved in rhesus but is found in orangutan and gorilla. The conservation of these editing sites implies a possible function for this Alu exonization in this non-coding, brainspecific gene.
In summary, exonization of regions of transposed elements is thought to be an important contributor to mammalian evolution and speciation. We found that exonization of transposed elements is biased towards the beginning of the coding sequence in both human and mouse genes. Analysis of SNPs revealed populationspecific exonization events, implying that exonizations may enhance divergence. These results shed light on TE fixation and the exonization process within human and mouse genes.

Dataset of TE exonizations within human and mouse protein coding genes
The dataset of human and mouse transposed element exonization was obtained from the TranspoGene database [39]. Based on UCSC genome browser annotations [66] of the human genome version hg17 and mouse genome version mm6. Sequences of TE exonizations within human and mouse protein coding genes were selected.

Normalization of exon location
Exon location was determined by using the knownGene table downloaded from the UCSC genome browser. In this table, all genes are listed along with their CDS start and end coordinates. To normalize the exon location within the CDS, we calculated the location for the start point of the exon in the CDS without exceeding the boundaries of the CDS (N = CDS length 2 exon length + 1). The normalized location was the quotient of the actual location of the exon start point within the CDS divided by N.

Cassette exon dataset
In order to create a dataset of cassette exons that had not originated from TE exonization, we downloaded the altSplice table from the UCSC genome browser [46,67]. We analyzed only the cassette exons dataset. We used GALAXY [68] and RepeatMasker in order to extract the sequences and exclude cassette exons that originated from TEs [ . For every family of TEs the average SNP density in the TE-body was determined. For comparison purposes, the SNP density in sequences surrounding the TEs was extracted in 50-bp non-overlapping windows from either end of the TE up to a distance of 3 kb. This yielded 120 windows which we call baselines. The positions of all TEs in the genome and locations of SNPs within each TE were determined using the SNP data set from UCSC Genome Browser Database. The same was done for the surrounding 50-bp non-overlapping windows (up to distance of 3 kb) for determination of the baseline density of SNPs. The SNP densities were averaged over all TEs and normalized to SNP frequency per 100 bp by dividing the average number of SNPs within the TE by the average length of the TEs divided by 100. Averaging the SNP frequencies in all 50bp windows flanking the TE yielded the baseline SNP frequency, similar to the calculation described in [13]. The number of SNPs in each of the 50-bp windows was multiplied by 2 to obtain the frequency per 100 bp. The SNP density in exonized TEs was then determined. Exons originating from exonizations of TEs that were flanked by canonical splice sites and that had at least two ESTs confirming their exonization were used. The average SNP density in the exonized TEs was determined for the human and mouse. All SNP densities are the SNPs per 100 bp.

SNPs in the splice sites of the exonized TEs
Annotations of SNPs were obtained from the UCSC Genome Browser Database [66] (versions hg17, May 2004 for human and mm6, March 2006 for mouse). A search for SNPs in splice site dinucleotides of exonized TEs was conducted. Any changes from GT or GC dinucleotides in the first two positions of the intron (59SS) and AG dinucleotides in the last two positions of the intron (39SS) by SNPs were considered; these mutations change a canonical splice site into a non-canonical one thus eliminating the selection of this exon by the splicing machinery. We also considered situations in which SNPs changed a non-canonical splice site into a canonical one if at least one transcript confirmed the existence as exon.
Population frequency data was obtained from the NCBI Database of Single Nucleotide Polymorphisms (dbSNP Build ID: 125) [73]. This data was only available for a small number of SNPs in dbSNP. Many researchers do not provide genotype or frequency data in their submissions. dbSNP Build ID 125 had approximately 27 million SNPs and only 3.5 million of these had frequency data associated with them.
Dataset of Alu exonization resulting from editing of the 39 splice site The dataset of Alu exonizations was searched for Alu elements with the non-canonical AA 39 splice sites or the AT non-canonical 59 splice site. These Alus were filtered according to the following criteria: (1) no SNPs were detected within these slice sites, (2) at least one A to G transition was detected between the DNA (AluS) is 731 base-pairs downstream of the exonized Alu. Sense and antisense Alus are expected to form double-stranded RNA, thus allowing RNA editing. RNA editing changes an AA dinucleotide into a functional AG 39 splice site (lower panel). RNA editing also occurs in three positions in the Aluderived exon (E1, E2, and E3). (B) Predicted folding of the sense and antisense Alu sequences (upper and lower lines, respectively). Adenosines that undergo editing are marked by red. Splice sites utilized for Alu exonization are marked as 59ss and 39ss on the alignment. (C) Alignment of this region from four species: human, gorilla, orangutan, and rhesus. The 59 splice site, 39 splice site, and the three editing positions are marked in yellow. doi:10.1371/journal.pone.0010907.g004 sequence and the mRNA, and (3) another Alu sequence in reverse orientation is located within a distance of 2000 bp.

Supporting Information
Table S1 SNPs in non-canonical splice sites of exonized transposed elements in the human genome as well as in the mouse genome resulting in a canonical splice site. Given are the gene id, the chromosome and strand on which the SNP is located, the start and end of the exon which derived from the transposed element, the transposed element's family, the SNP id and the alleles of the SNP and the position at which the SNP is located (always seen from the exon, that is, 1st position of acceptor indicates the base which is located nearest to the splice site). Found at: doi:10.1371/journal.pone.0010907.s001 (0.05 MB DOC)

Table S2
Population frequency data for the SNPs which changed a non-canonical splice site into a canonical one while the other splice site was already canonical. Given is the SNP id along with the alleles and the position where this SNP occurred as well as the frequency data. Here, the homozygosity for the first allele, the heterozygosity, the homozygosity for the second allele, the Hardy-Weinberg proportions as well as the frequencies for each of the alleles are given. CEPH-European, HISP-Hispanic, AD-African American, CEU-European, HCB-Asian, JPT-Asian, YRI-Sub-Saharan African, HWP-Hardy-Weinberg proportions. Found at: doi:10.1371/journal.pone.0010907.s002 (0.10 MB DOC) Author Contributions