The Role of Recombination in the Origin and Evolution of Alu Subfamilies

Alus are the most abundant and successful short interspersed nuclear elements found in primate genomes. In humans, they represent about 10% of the genome, although few are retrotransposition-competent and are clustered into subfamilies according to the source gene from which they evolved. Recombination between them can lead to genomic rearrangements of clinical and evolutionary significance. In this study, we have addressed the role of recombination in the origin of chimeric Alu source genes by the analysis of all known consensus sequences of human Alus. From the allelic diversity of Alu consensus sequences, validated in extant elements resulting from whole genome searches, distinct events of recombination were detected in the origin of particular subfamilies of AluS and AluY source genes. These results demonstrate that at least two subfamilies are likely to have emerged from ectopic Alu-Alu recombination, which stimulates further research regarding the potential of chimeric active Alus to punctuate the genome.


Introduction
Alus are the most abundant and successful Short Interspersed Nuclear Elements (SINEs) and are exclusively found in primate genomes. In humans, they represent nearly 10% of the nuclear genome corresponding to over 1 million copies and a frequency of one insertion per 3 Kb [1,2]. An Alu is about 300 bp long and is composed by two monomers with origin in the 7SL RNA gene [2] punctuated by several CpG doublets and attached to one another by a poly-A stretch. A second poly-A tail is present at the 39end. Active Alus, also known as source or master genes, are those that are able to generate progeny by reverse transcription of an Alu RNA molecule that is inserted in novel genomic locations [3,4]. The Alu retrotransposition rate in humans was estimated to be 1/ 21 births [5] which is a significant contribution of these elements to human diversity. Most Alus in the genome are inactive as retrotransposition ability is often impaired by truncation of 59 bases, shortening of the poly-A tail, or other mutations that occur during, or sometimes after, genome integration [6].
Alus were first classified in distinct subfamilies that share specific (diagnostic) positions [20]. But, since events of substitution back mutation and recombination [21] are frequent, such criterion was later changed to a collection of Alus that had origin in the same source gene [22], although multiple source genes can contribute to the same subfamily [23,24].
Previous studies have documented cases of Alu chimerism [7] as a source of intra-subfamily variability [25], Alu re-activation [26], and as a source for the emergence of new subfamilies in New World monkeys [27]. In line with this, we posed the pertinent question: did any human Alu subfamily emerged from a chimeric source gene resulting from Alu-Alu recombination? This work aims at searching such chimeric elements in humans. We started by creating a database of subfamily polymorphic sites. Focusing mainly in insertion/deletion (indel) markers and motivated by findings in whole genome searches that established the presence of both insertion and deletion alleles in the extant genome, we were able to detect two cases of recombination: (a) AluSx4 and, (b) one cluster of subfamilies that includes either AluYe2, AluYe4, or AluYe5, AluYe6 and AluYf5. Our work establishes that Alu-Alu recombination offers the genome new elements which are free to retrotranspose, evolve and play their role in the emergence of phenotypic novelties.

Collection of Human Consensus Alus
Sequences corresponding to consensus Alu were retrieved from the Repbase Update [28] database and from previous work [22,29,30]. Manual inspection of all sequences revealed cases with more than a single consensus sequence documented for a specific subfamily, as is the case of AluYa1 subfamily. To avoid arbitrary decision on choosing the exact sequence representing the subfamily, all sequences were included in the analyses and distinguished as, for instance, AluYa1_1 and AluYa1_2. The collection of human Alu consensus sequences is provided in Text S1.

Database of Polymorphic Sites in Consensus Alus
The collection of consensus Alus was aligned in Geneious v5.4 [31] and poly-A tails were not considered (Text S2). The ancestral AluJo was set up as the reference sequence in our analyses and, consequently, position numbering was performed according to AluJo consensus sequence (Figure 1). Insertion and deletion polymorphisms (indels) were named as in the following example: a single-base deletion in position 65 is indicated as ''65 del'' and an insertion of an adenine after position 177 is indicated as ''177.1 ins'' (AluJo). Consensus Alus and polymorphic sites were then inputted into a database that provides all the information regarding the position and the distinct allelic forms of each polymorphism present in human consensus sequences. The database of Alu variability is accessible in Dataset S1.

Whole Genome Search of Alu Indels
The presence of Alu indels in the extant human genome sequences was carried out using a Python script. The BioPython toolkit [32] was used to blast the NCBI human genome reference sequence (November 2012, Human Annotation Release 104) using an e-value threshold of 10 25 and allowing no gaps between the query and the subject sequence in order to prevent crosscontamination of each list with the counterpart allelic form. A total of 23 sequences (Table S1) were used as queries in the blast search. These sequences correspond to the 20 allelic forms of simple indels and a more complex pattern displayed in positions 65 and 66 with three allelic forms (65-66 ins: YT; 65 del: -T; 65-66 del: -). Each of these sequences was retrieved from a consensus Alu carrying the target allele (e.g., the 87-98 ins allele is represented by the AluY sequence whereas its counterpart, the 87-98 del allele, is represented by the AluYc5). The retrieved sequence hits were saved in fasta files and aligned in Geneious v5.4. The results were assembled in an excel format (Dataset S2) that holds a total of 144398 hits.

Network Calculation using Indels
The Network 4611 software (http://www.fluxus-engineering. com/sharenet.htm) was used to cluster the entire collection of Alu sequences represented in the database. Allelic forma were converted in binary data (presence/absence) in the input file and only indel markers were used. Polymorphisms in poly-A linker and tail were not included. Each mutation site was equally weighted 10. All networks were calculated using the reduced median (RM) algorithm with the default parameters. Sequence Alignment of at least one representative of each haplotype defined by the 11 indel markers; node 1 is represented by two sequences: AluJo and AluSx. Position numbering was performed according to the reference AluJo. The first base of each indel is also indicated (red). Poly-A linker polymorphisms were disregarded. Dots represent identical bases and hyphens represent gaps (absent or deleted bases). R represents bases A or G according to the IUPAC code for nucleotide ambiguities. doi:10.1371/journal.pone.0064884.g001

Database of Polymorphic Sites for Consensus Alus
Most Alu copies inserted in the genome are inactive. This is especially evident in older subfamilies that no longer have active source genes due to a gradual accumulation of mutations. Analyses were performed using Alu consensus sequences, since it is important to consider the original sequence of each subfamily source gene. A consensus sequence is, by definition, a sequence that represents the very first source gene of a subfamily [33].
The collection of Alu consensus sequence retrieved from databases and related literature includes a total of 86 unique consensus sequences matching 73 distinct subfamilies. Of these, four correspond to the ancestral AluJ, 20 are documented as AluS sequences and 49 as AluY, the youngest family member [34]. Sequences were aligned for further comparison and AluJo was set as reference (Text S2). Position numbering was performed accordingly (Figure 1). This analysis revealed a total of 144 polymorphic positions (132 SNPs and 11 indels) that were combined into a database (Dataset S1) of Alu polymorphic variation. More than two alleles exist in 17 out of the 132 SNPs detected among Alu sequences, and in a single case (position 262 of AluJo) all four alleles were observed. The polymorphic indels show length sizes ranging from 1 to 19 bp (Dataset S1) and with the exception of positions 65 and 66, no size heterogeneity within the inserted/deleted sequence was observed.

Whole Genome Search of Alu Indels
Assuming that each consensus sequence evolved from preexistent elements by mutation accumulation, we reconstructed the phylogenetic relationship between human Alu subfamilies using the available polymorphic information. Because many SNPs involve CpG dinucleotides and very few are subfamily-specific (see Dataset S1), we exploited the informativeness of indels discovered in the record of consensus sequences (Figure 1 and Dataset S1) to trace Alu lineages that date back 65 Myr [35]. To exclude the possibility that these insertions/deletions would result from errors or gaps during sequence reconstruction, a whole genome search was performed in the human reference sequence, as well as in nucleotide NCBI genome sequences using insertion and deletion alleles as queries. Examples of resulting sequence hits for each allele are presented in Figure 2 (and more detailed information is given in Dataset S2). No general conclusions can be made relative to allele frequencies, as this strategy was intended to identify highly similar sequences, discarding those which accumulated a significant number of mutations over time that are below the limits of detection.

Evolutionary Clustering of Active Human Alus
Once it was established that indel markers are not artifacts of sequence alignment at the time of consensus prediction, we used the haplotypic combination of indels to demonstrate the evolutionary relationships between Alu elements ( Figure 3). As a result of size heterogeneity in positions 65 and 66 (65-66 ins, 65-66 del and 65 del), located in the left monomer, two networks were constructed: one assuming that the three combinations resulted consecutively (65-66 ins -65 del -66 del) ( Figure 3A) and the other assuming that they were independent events (65-66 ins -65 del and 65-66 ins -65-66 del) ( Figure 3B). Both analyses exhibited similar graphs, an indication that the origin of the mutational events does not alter the clustering inference.
With the exception of two reticulations that clearly show alternative solutions to mutational events, both networks are well resolved revealing that most active genes originated from preexisting sequences by mutation. The two reticulations that link nodes 1, 2, 3, 4 and 7, 13, 14, 15, may allude to events of Alu-Alu recombination and this hypothesis was further explored. In one of the cases (Figure 3, left reticulation), the Alu subfamilies represented in nodes 1, 2, 3 and 4 are distinguished by the haplotypic combination of 65-66 and 265.1 polymorphisms (Figure 4). Because positions 65 and 66 are deleted in the youngest AluY subfamily, and present in the old AluJo, the ancestral allele is 65-66 ins (Figure 3, node 1) [36]. Following the same rationale, the 265.1 ins is likely to be the youngest allele. Therefore, several alternative pathways were considered (Figure 4) based on the order of mutational events occurring in each monomer.
The most likely explanation for the emergence of these haplotypes is recombination (Figure 4, A and B). Path A illustrates the emergence of AluSx4 by recombination between an AluSq4 and an Alu lacking the 265.1 insertion, whereas path B shows the emergence of subfamilies on node 2 (AluSp, AluSq, AluSq2, AluSq3 and AluSq10) by recombination of an AluSq4 and an Alu from node 1. In-depth analyses of the consensus sequences of the subfamilies involved made it possible to discern the most parsimonious hypothesis: path A. AluSx4 differs from AluSq4 by the T98C substitution in the left monomer (Figure 4, alignment) and values of pairwise identity among the right monomers of all possible candidates to be donors (those not carrying the 265.1 ins) revealed that the most likely contributor was AluSx3 since both differ in a single site (G191A) ( Figure 4) and share 99.3% of sequence identity. Both SNPs 98C and 191A are specific of AluSx4. Pathway B is less likely as it would require a minimum of ten extra mutational steps subsequently to the putative recombination between AluSq4 and elements of node 1. Although both pathways involve a recombination event, the one that requires less mutational steps is pathway A, which points to the origin of the AluSx4 subfamily through the recombination between an AluSq4 and Sx3 (Figure 3 and 4). If this is the case, the most likely representation of Alu evolution is shown in Figure 3B, that is the deletion at positions 65-66 had origin in the ancestral 65-66 ins allele.
Indels have a very low mutation rate, less than on tenth of SNP's mutation rate [37]. Although less likely, events of indel back (Figure 4, C and D) or recurrent mutations (Figure 4, E or F) are also possible explanations for the emergence of the observed haplotypes. Concerning back mutation events, path C describes the emergence of AluSx4 by the deletion of base 265.1 in an AluSq4. In path D, the subfamilies of node 2 emerge from an AluSq4 by the re-insertion of a C in position 65. These paths are characterized by the succession of mutations in which the emergence of the ancestral allele is possible although extremely unlikely concerning an indel marker. Also, events of recurrent mutation are equally possible and equally unlikely. Path E illustrates the independent origin of AluSx4 and the elements of node 2, followed by the origin of AluSq4 through a deletion of base 65, while path F shows the independent insertion of base 265.1.
The network reticulation on the right (Figure 3) has an even higher number of possible explanations for the appearance of the observed haplotypes ( Figure 5). In this case, the key positions to establish the alternative mutational pathways are the 206.1 and 266-267.
Three pathways (Figure 5 A, B and C) imply an event of recombination. Regarding path A, assuming that AluYe4 and AluYe2 resulted from mutations in distinct lineages (206.1 ins and 266-267 del, respectively) of an ancestral sequence in node 7, and that a recombination event occurred between them, the ancestral of the subfamilies in node 14 (AluYe5, AluYe6 and AluYf5) was a recombinant Alu. With respect to pathway B, AluYe4 is a recombinant composed by a 59part from a member of node 14 (AluYe5, AluYe6 or AluYf5) and a 39 part from an element with the 266-267 ins allele from a member of nodes 7, 10 or 11. Lastly, pathway C describes AluYe2 as a recombinant between one of the elements from node 14 (AluYe5, AluYe6 or AluYf5) Alu and an element from node 7.
As with the previous example, the allelic configuration of these elements was analysed and combined with information provided by pairwise identity scores between the involved elements. These analyses did not reveal the most parsimonious hypothesis, as the identity scores between recombinant (chimeric) Alus and their corresponding parental elements reached about 100% in all cases, which is the result of the recent origin of the AluY subfamily [34]. Events of back and recurrent mutation ( Figure S1) could also explain the existence of the haplotypes of these subfamilies; however, due to the recent advent (20 Mya) of the AluY clade [38], these hypotheses are even less likely to occur. Back and recurrent mutations are even rarer when considering indels longer that 1 bp, which is the case of indel 266-267. Furthermore, since the allele 206.1 del seems to be strongly associated with two SNPs (211A and 220T, Figure 1), events of back mutation would also have to occur in those two sites to result in the haplotypic combination observed in these subfamilies, which reinforces the unlikeliness of these events.  Figure 3. An alignment of at least a representative of each involved node is displayed, plus two representatives of node 7 (AluY and AluSx3). Alternative pathways are named A to F. A and B represent recombination events (green), C and D represent events of back mutation (orange) and E and F represent recurrent mutations (blue). doi:10.1371/journal.pone.0064884.g004

Alu Master Genes can Originate through Recombination
Events of ectopic recombination among Alu elements are known to be associated with deleterious rearrangements [9][10][11]13,14,39] and Alu chimerization [22,25,26,40] as are for instance those reactivated by partial gene conversion involving the poly-A tail at the 39end [26]. In this study, we searched for signs of recombination in known consensus sequences that represent the original source gene of each subfamily. Although predicted based on sequence homology, each consensus Alu must carry all the necessary elements to the retrotransposition process. Previous work tested 13 consensus Alus (AluJo, AluSx, AluY, AluYa5, AluYa5a2, AluYb8, AluYc1, AluYd8, AluYe5, AluYf2, AluYg6, AluYi6, AluYj4) and showed that all of them are able to retrotranspose, including the ancient AluJo [29].
We started by collecting all known consensus Alu sequences in humans and compiled them in a database that includes 86 sequences from 73 subfamilies and a total of 144 polymorphic positions (Dataset S1). Among the polymorphisms found, 11 are indels and were used to establish the historical relationship between the distinct subfamilies. The graphical clustering of all 73 Alu subfamilies revealed two distinct reticulations (Figure 3) that were analysed to evaluate all possible mutational and/or recombination events. After considering all possible pathways we could establish the role of Alu-Alu recombination in the origin of chimeric master genes, though it is not clear whereas the underlying mechanism was crossover or gene conversion. Our uncertainty in distinguishing between crossover and gene conversion is due to the lack of information on the flanking genomic region of the original master genes. Although gene conversion has been assumedly more frequent than crossover in Alu recombination [41][42][43], direct proof of gene conversion would only be possible if both recombination products were available [44].

The Family Tree of Human Alus based on Polymorphic Information
A general analysis of the information provided by both indels and SNPs allowed the distinction of Alu subfamilies according to informative positions ( Figure 6). Despite the information provided by the combination of both marker types, large clusters incorporating a vast number of subfamilies, mainly in what refers to young AluY elements, are still observed. It is important to mention that although a high number of polymorphic positions were detected among Alu consensus sequences (Dataset S1), only A120T, G194A, T214C, C215G and G219C represent single occurrences in the history of Alu sequences that can be used as diagnostic positions ( Figure 6).
Data presented in Figure 6 is relevant in many other aspects. There are cases in which subfamilies with more than one consensus sequence were clustered in distinct nodes, having different haplotypic combinations, as is the case of AluYb6. Such cases reveal that the boundaries of individualization of a subfamily are unclear. So, the questions we put forward are: (a) by how many mutational steps can a source gene differ from its parental gene and still be considered as a subfamily member and, the other way around, (b) how many mutations are necessary for an Alu sequence to be considered the founder of a new subfamily? Although we were able to detect two cases of recombination, or approach may have failed to detect additional cases of subfamilies that emerged by the same process. More data is needed in order to evaluate the complex role of ectopic recombination in the birth of chimeric Alu elements with retrotransposition ability, thus increasing genomic variability, creating new Alu insertions, and promoting further non-allelic homologous recombination.

Supporting Information
Figure S1 Additional alternative pathways for the origin of Alu subfamilies clustered in nodes 13, 14 and 15 of Figure 3. Alternative pathways are named A to G. A, B and C represent recombination events (green), D and E represent events of back mutation (orange) and F and G represent recurrent mutations (blue). (TIF) Text S1 List of human Alu consensus sequences.

(DOCX)
Text S2 Complete alignment of human Alu consensus sequences.

(TXT)
Dataset S1 Database of all polymorphic positions detected in the complete list of human consensus Alus. Position numbering was performed accordingly to AluJo. Major subfamily-specific mutations are coloured blue (sites 120, 194, 214 and 215) and green (site 219) and are specific of AluJ and AluY, respectively. Other subfamily-specific mutations are coloured grey. (XLSX) Dataset S2 Human sequences that match each indel allele retrieved from whole genome searches.