Oscillating Evolution of a Mammalian Locus with Overlapping Reading Frames: An XLαs/ALEX Relay

XLαs and ALEX are structurally unrelated mammalian proteins translated from alternative overlapping reading frames of a single transcript. Not only are they encoded by the same locus, but a specific XLαs/ALEX interaction is essential for G-protein signaling in neuroendocrine cells. A disruption of this interaction leads to abnormal human phenotypes, including mental retardation and growth deficiency. The region of overlap between the two reading frames evolves at a remarkable speed: the divergence between human and mouse ALEX polypeptides makes them virtually unalignable. To trace the evolution of this puzzling locus, we sequenced it in apes, Old World monkeys, and a New World monkey. We show that the overlap between the two reading frames and the physical interaction between the two proteins force the locus to evolve in an unprecedented way. Namely, to maintain two overlapping protein-coding regions the locus is forced to have high GC content, which significantly elevates its intrinsic evolutionary rate. However, the two encoded proteins cannot afford to change too quickly relative to each other as this may impair their interaction and lead to severe physiological consequences. As a result XLαs and ALEX evolve in an oscillating fashion constantly balancing the rates of amino acid replacements. This is the first example of a rapidly evolving locus encoding interacting proteins via overlapping reading frames, with a possible link to the origin of species-specific neurological differences.


Introduction
The GNAS1 locus encodes the stimulatory G-protein subunit a, a key element of the classical signal transduction pathway linking receptor-ligand interactions with the activation of adenylyl cyclase and a variety of cellular responses [1][2][3]. The gene is subject to complex imprinting, producing a spectrum of maternally, paternally, and biallelically derived transcripts [4]. The major paternally imprinted transcript of the gene is expressed primarily in neuroendocrine tissues and includes an unusually large upstream exon (the XL-exon) comprising over 50% of the protein-coding region. The XLexon contains two completely overlapping reading frames in the same orientation but shifted one nucleotide relative to each other so that codon positions 1, 2, and 3 of the first frame overlap with positions 3, 1, and 2 of the second frame. In humans the first frame of the exon encodes 388 N-terminal amino acids of a 736-residue extra large form of Ga (XLas) [5][6][7][8][9][10]. The second frame encodes all 322 amino acids of alternative gene product encoded by the XL-exon (ALEX) and terminates exactly at the end of the exon. The internal section of the XL exon contains imperfect repeated units of variable length translated into amino acid repeats averaging 13 residues in both XLas and ALEX [4]. The repeat number varies in a studied human population (n ¼ 276), with the majority carrying a 13-unit allele, while an insertion of an additional unit (the 14-unit allele) is found in 2.2% of surveyed individuals [11]. Heterozygous individuals with a maternally inherited 14-unit allele and 13-unit homozygotes are normal. Conversely, carriers of a paternally inherited 14-unit allele exhibit hyperactivity of the G-protein pathway and suffer from a variety of pathological conditions such as mental retardation, brachymetacarpia, hypertrichosis, hypotonia, growth deficiency, or prolonged trauma-induced bleeding [12]. Binding assays showed a decreased affinity between XLas and ALEX in individuals carrying the 14-unit allele that leads to an elevated concentration of free XLas (unbound to ALEX) capable of activating adenylyl cyclase [12]. As a result, the intracellular cAMP concentration rises to over 600% of the normal level. Thus, ALEX regulates the intracellular cAMP level by specifically binding XLas and preventing it from interacting with the receptors and adenylyl cyclase [12,13]. Loss-of-function mutations involving XLas also lead to severe adverse effects. Mice lacking XLas expression show poor postnatal development with the majority dying within 48 h of birth [5].
The functional importance of XLas and ALEX suggested by these examples implies that this locus should be under considerable selective constraint. Yet the XL-exon evolves at a remarkable pace: the nucleotide identity between human and mouse XL-exon is only 71%, and the amino acid identity between human and mouse ALEX is 53% [13]. For comparison, the average nucleotide and amino acid identities between human and mouse protein-coding genes and their protein products are 86% and 89%, respectively [14]. Why would a locus encoding two essential signaling proteins evolve so rapidly?

Results/Discussion
To take a closer look at the evolutionary dynamics of XLas and ALEX, we sequenced the XL-exon from eight primates and immediately found striking differences within the repeatcontaining region (we used XL-exon boundaries as described in Hayward et al. [4]; also see Methods). All studied species that included human, apes (chimpanzee, gorilla, orangutan, and gibbon), Old World monkeys (colobus and macaque), and a New World monkey (squirrel monkey) varied in the number and/or sequence of repeated units ( Figure 1). Human had the smallest number of repeats, while the remaining taxa contained at least one additional repeat unit between positions I and N ( Figure 1), a region where an insertion in humans is linked to disease. Taxa closest to human (chimpanzee, gorilla, and orangutan) carried the largest number of repeat units and an additional insertion at position B. Gibbon, colobus, macaque, and squirrel monkey contained an additional insertion at position H. Assuming that the sequenced alleles are fixed in the respective primate populations, XL-exon experienced an episode of repeat expansion in the greater ape lineage followed by a dramatic repeat loss on the branch leading from the human/chimpanzee ancestor to modern humans ( Figure 2). Note that in all sampled species both reading frames remain intact regardless of the insertion/deletion events. The observed pattern may have implications for the evolution of species-specific neurological and metabolic differences (discussed below) since the variation in the number of repeats has profound developmental and physiological effects [5,11,12].
Next, we analyzed the pattern of nucleotide substitutions within the XL-exon (excluding the repeat-containing region) and observed a striking oscillation of amino acid replacement rates between the XLas and ALEX. The interaction between the two proteins imposes a unique constraint: if one protein changes the other needs to rapidly ''evolve'' a compensatory substitution to preserve the mutual affinity. Although this cannot be observed directly in our data because such changes are likely to occur within each lineage in rapid succession, the overall effect of this process should result in similar rates of amino acid replacements in the two proteins. To test this hypothesis we compared nucleotide substitutions between XLas and ALEX frames in sequenced species. Classical measures of nucleotide substitution rates such as K S and K A [15] are not directly applicable here because of the interdependence of the two overlapping frames [16][17][18]. However, these measures can be used in a relative context. Specifically, the ratio of nonsynonymous rates between the two frames ( XL K A / ALEX K A ) can be used to test the equality of amino acid replacement rates between the two proteins. To carry out this analysis we reconstructed a phylogenetic tree using unambiguously aligning portions of the XL-exon. For every branch of the tree we computed the XL K A / ALEX K A ratio using maximum likelihood estimates of nonsynonymous rates for each frame ( Figure 2). Ratios vary considerably among

Synopsis
One of the possible ways to achieve tight co-expression of two proteins is to encode them within a single mRNA. The GNAS1 gene in mammals does just that: it encodes two interacting signaling polypeptides within a single transcript using nested reading frames shifted one nucleotide relative to each other. The exceptionally high GC content of the region where the two reading frames overlap diminishes the probability of encountering stop codons but makes the locus highly mutable. To preserve their ability to interact functionally with each other despite the high mutation rate, the two polypeptides appear to evolve in an oscillating fashion, trying to maintain approximately equal rates of amino acid substitutions. This unexpected observation provides new insights into the evolution of mostly overlooked overlapping coding regions in eukaryotic genomes.
branches. For example, branches originating from node 3 (3!Pp and 3!2) show opposing XL K A / ALEX K A ratios. However, none of the ratios is significantly different from 1 (pvalues from Fisher's exact test are between 0.14 and 0.77), supporting our hypothesis that the two proteins constantly co-evolve and maintain XL K A / ALEX K A of approximately 1.
A possible caveat of this analysis is the use of internal nodes because the likelihood method we employed to estimate branch-specific rates was not intended to handle coding sequences with multiple reading frames. To address this, we estimated pairwise K A between XLas and ALEX reading frames. For this purpose we developed a neighbor-dependent modification of the Nei-Gojobori (NG) method [19]. Unlike the classical NG, our method estimates the number of synonymous and nonsynonymous changes in a given frame (i.e., XLas) without considering any pathways that would create nonsense codons in the other frame (i.e., ALEX). The resulting estimates were only slightly different from the NG, Yang-Nielsen [20], and likelihood [21] methods, as the high GC content of XL-exon (68% in human) decreases the chance of encountering pathways that contain nonsense codons (Table 1). We used the new K A estimates to calculate the XL K A / ALEX K A ratio for each pair of species. Again, although the ratios varied substantially, none was significantly different from 1 (at 1% level; Table 1). The observed oscillation of the XL K A / ALEX K A ratio around 1 likely implies constant adjustment between the two proteins aimed at maintaining mutual affinity.
The phenomenon of oscillation is also confirmed by the pattern of nucleotide substitutions at different codon positions. Third codon positions of the XLas frame, where most changes are synonymous, correspond to second codon We modified the original parsimony algorithm by omitting ancestral states that may create stop codons in either of the two frames. Although ancestral sequences reconstructed using parsimony cannot be used as observed data [22], this analysis once again shows evolutionary fluctuation between the two frames ( Figure 2). For example, the majority of substitutions on branches leading to Ca, Mm, and Sb are in the third codon position of the XLas frame (corresponding to the 0-fold degenerate second codon position of the ALEX frame). This is also the case for the branch leading to Pp, while other branches within the human/ ape clade show the opposite pattern-most substitutions are now in mostly 0-fold degenerate first and second codon positions of the XLas frame. In addition, there are examples of recurrent substitutions leading to the same amino acids in different lineages (Table 2), thus, suggesting that multiple optimal variants of the two proteins are allowed.
The high GC content of the XL-exon (ranging from 68% in human to 71% in squirrel monkey) is ''the blessing and the curse'' of the locus: it appears to be required for the maintenance of the two reading frames, but inevitably leads to a high substitution rate. A consequence of the high GC content is the abundance of GC-rich codons in the XLas and ALEX frames. For instance, the most abundant codons in XLas and ALEX frames are GCC (10.6%) and CCG (8.9%), respectively ( Figure 3). For comparison, average frequencies of these codons in humans (estimated from RefSeq genes) are 2.8% and 0.7%, respectively. The GC content may be driven up by a selection acting against mutations to A and T, as these can lead to the formation of stop codons (TAA, TAG, TGA) in either of the two frames. To test this hypothesis, we simulated the eight sequences in our dataset using three different codon frequency tables compiled from (1) all human RefSeq genes, (2) XLas reading frame, and (3) ALEX reading frame. All other parameters (phylogenetic tree, branch lengths, transition/transversion ratio, codon number, and the K A /K S ratio as estimated from the original dataset) were fixed, and each simulation was performed 1,000,000 times. Each set of simulated sequences was examined for the presence of alternative reading frames. For example, for every set of sequences simulated using XLas codon frequencies, we looked for the presence of an alternative reading frame in þ1 phase. None of the sets from the first simulation (RefSeq codon frequencies) contained such frames, whereas approximately 1% of sets in each of the second (XLas codon frequencies) and the third (ALEX codon frequencies) simulations contained alternative frames in þ1 and À1 positions, respectively. Thus the high GC content allows for overlapping reading frames. The high GC content also leads to an excess of CpG dinucleotides, which occupy approximately 20% of XL-exon (108-119 CpG sites or 18%-21% of the sequence length, depending on the species). This is significantly higher than in the majority of primate sequences (empirical p ¼ 0.0013): the proportion of CpG sites in human protein-coding regions from the RefSeq database have narrow distribution with a mean of 7% (99% confidence interval: [7.17%; 7.43%]). In mammals, mutation rate at CpG dinucleotides is 10-20 times higher than at other sites [23][24][25]. As a result, although CpG sites occupy only approximately 20% of the sequence in our dataset, approximately 50% of the observed nucleotide substitutions (responsible for approximately 30% of amino acid replacements) occur at these sites (Table 3). In this analysis we do not correct for multiple substitutions because existing models cannot be used in the context of XLas/ALEX locus. Thus, the actual rate of evolution of XL-exon is even higher than observed. Remarkably, the majority of potential deamination events at CpG sites (CpG ! CpA and CpG ! TpG transitions) do not create stop codons in either of the two reading frames. Indeed, the in silico deamination of all CpG sites (109-118 replacements, depending on the species) to either TpG or CpA created only four stop codons in the XLas and none in the ALEX frame in each species. In contrast, the simulated deamination caused on average 140 and 129 amino acid changes in XLas and in ALEX, respectively. Therefore, high GC content leads to the high intrinsic mutability of the XL-exon but allows avoidance of stop codons.
These results suggest the following model of XLas/ALEX evolution that favors purifying selection acting on the two proteins. The benefit in encoding the two signal transduction proteins within the same mRNA molecule might be the tight expression coupling: it guarantees that the two proteins are made at the same place and at the same time. To maintain two long, overlapping reading frames the XL-exon must contain an excess of GC-rich codons, but this also leads to the elevated frequency of mutation-prone CpG dinucleotides. Because the two proteins physically interact, they must accumulate amino acid substitutions in concert: neither can change too much relative to the other as their mutual affinity may become adversely affected. Therefore, a nonsynonymous mutation causing a deleterious change in affinity must be quickly corrected by either reversal or compensatory change [26]. The high mutation rate of the XL-exon, which is due to the high frequency of CpG sites, may allow such ''corrective'' changes to occur quickly. The reversals and/or compensatory changes likely occur in rapid succession, keeping the overall ratio of nonsynonymous changes ( XL K A / ALEX K A ) close to 1 for a given lineage, a phenomenon observed in our data (see Figure 2). The shortcoming of this stochastic process is that by constantly adjusting to each other, XLas and ALEX may drift beyond the acceptable level of mutual affinity. One way to overcome this situation might be by changing the number of internal repeat units that may serve as sandbags on an air balloon-allowing rapid changes in affinity in a single step (e.g., an addition/deletion of a single repeat unit in humans causes a significant change in affinity [12]). This may explain remarkable variation in the number of internal repeat units in human and apes. This simple model implies that the two proteins evolve under a purifying selection scenario and that the observed high substitution rate is a consequence of the high GC content imposed by the need to maintain two reading frames.
We cannot rule out an alternative adaptive evolution explanation of the variation in the number of repeats and the pattern of amino acid changes in XLas and ALEX. XLas and ALEX are predominantly expressed in neuroendocrine tissues where they likely play a role in the development and maintenance of neurological functions [5,12,27]. In particular, XLas expression is evident in distinct regions of the brain controlling processing of sensory information (locus coeruleus) and innervation of orofacial muscles (i.e., facial nucleus) [5]. Individuals with disrupted XLas/ALEX interactions have multiple neurological complications, including feeding motility problems, psychomotor retardation, and disturbed behavior [12]. It is therefore plausible that amino acid replacements and the variation in the internal repeat number may have been associated with the adaptation of Gprotein signaling to specific neurological functions, perhaps specific to humans. However, to reliably distinguish between the possibilities of purifying and positive selection, it is necessary to experimentally measure XLas/ALEX affinities in primates-a direction currently pursued by our laboratories.
Is the XLas/ALEX locus the only example of extensively overlapping reading frames in mammals? Only three additional cases are known where protein products of both reading frames were biochemically characterized. These include genes for the cyclin D-dependent kinase inhibitor INK4a [28], X-box protein 1 [29], and a region of overlap between 4E-BP3 and MASK [30]. Discovery of genes with alternative reading frames is hampered by our disbelief in their existence. For example, ALEX was discovered long after the XLas gene had been identified [9,13]. Early results from our laboratories indicate that there are many more genes (possibly hundreds) potentially encoding multiple proteins via alternative reading frames. In each case the alternative reading frame is conserved in all known mammalian orthologs of a gene. Similarly to XLas, most of these genes have been known for some time but the presence of the alternative reading frame has never been discovered. Biochemical characterization of these alternative products is underway and may assist us in discerning yet another facet of mammalian gene organization and evolution.

Materials and Methods
Amplification and sequencing of XL-exon. The entire XL-exon was amplified from genomic DNA in all eight species, using primers 990F and 2954R or 2428R (Table 4). These primers were designed using published human sequence [4]. Specifically positions 318 and 511 within XL-exon were considered to be starts of XLas and ALEX coding regions, respectively (as defined in [31] and [13]  Data analysis. Reliable alignment was generated by first translating nucleotide sequences from each taxa, aligning the translations using ClustalW [32], refining these alignments manually, and then reconstructing nucleotide alignments, using the protein alignment as a guide. Phylogenetic tree and most statistics were calculated using the PAML software package [33]. All analyses were performed on the region of overlap between the two reading frames, excluding the repetitive region. Synonymous and nonsynonymous rates were apportioned among the branches of the tree using the codeml program of the PAML package under the free ratio model [34].
The neighbor-dependent modification of the NG method was written in PERL programming language and is available from the authors upon request. The only difference from the classical NG algorithm [19] is that pathways creating stop codons in the alternative reading frame are ignored by our method. For example, let us consider the alignment in Table 5.
The alignment contains two reading frames: frame 0 starting at position 0 and frame 1 starting at position 1. The second codon of frame 0 contains two substitutions, and so there are two possible parsimonious pathways: (2) Pathway 2 would convert the second codon of frame 1 into a stop (TAG), and so it is not considered by our method.
To test whether the GC content of the XL-exon is required for the coexistence of the two reading frames, we first estimated codon frequencies in (1) human RefSeq genes, (2) XLas reading frame, and (3) ALEX reading frame. This procedure was performed using a custom-designed PERL script. Coding regions of human RefSeq genes were downloaded from the National Center for Biotechnology Information ftp site (ftp://ftp.ncbi.nlm.nih.gov). We then used the evolver program of the PAML package to simulate 1,000,000 sequence sets, using the three codon frequency tables. Each set contained eight sequences corresponding to primate species used in this study. All other parameters accepted by evolver (phylogenetic tree, branch lengths, transition/transversion ratio, codon number, and the K A /K S ratio) were taken from codeml output generated during nucleotide substitution analysis of our data and were fixed in all three simulations. Each set of simulated sequences was then inspected for the presence of þ1 and À1 overlapping reading frames. A set of simulated sequences was considered to have an overlapping reading frames if such frame was greater than or equal to 1,000 bp and was conserved in all eight sequences within the set.
Analysis of substitutions at CpG sites was carried out using a collection of PERL script, which can be obtained upon request.

Acknowledgments
We thank Ross Hardison, Webb Miller, Davis Ng, and the members of the Center for Comparative Genomics and Bioinformatics for helpful insights and discussions. Genomic DNA for chimpanzee and macaque was obtained from the Coriell Institute for Medical Research. The study was supported by funds from the Pennsylvania State University, the Huck Institutes for Life Sciences, and the National Institutes of Health.
Competing interests. The authors have declared that no competing interests exist.