Evolutionary Convergence on Highly-Conserved 3′ Intron Structures in Intron-Poor Eukaryotes and Insights into the Ancestral Eukaryotic Genome

The presence of spliceosomal introns in eukaryotes raises a range of questions about genomic evolution. Along with the fundamental mysteries of introns' initial proliferation and persistence, the evolutionary forces acting on intron sequences remain largely mysterious. Intron number varies across species from a few introns per genome to several introns per gene, and the elements of intron sequences directly implicated in splicing vary from degenerate to strict consensus motifs. We report a 50-species comparative genomic study of intron sequences across most eukaryotic groups. We find two broad and striking patterns. First, we find that some highly intron-poor lineages have undergone evolutionary convergence to strong 3′ consensus intron structures. This finding holds for both branch point sequence and distance between the branch point and the 3′ splice site. Interestingly, this difference appears to exist within the genomes of green alga of the genus Ostreococcus, which exhibit highly constrained intron sequences through most of the intron-poor genome, but not in one much more intron-dense genomic region. Second, we find evidence that ancestral genomes contained highly variable branch point sequences, similar to more complex modern intron-rich eukaryotic lineages. In addition, ancestral structures are likely to have included polyT tails similar to those in metazoans and plants, which we found in a variety of protist lineages. Intriguingly, intron structure evolution appears to be quite different across lineages experiencing different types of genome reduction: whereas lineages with very few introns tend towards highly regular intronic sequences, lineages with very short introns tend towards highly degenerate sequences. Together, these results attest to the complex nature of ancestral eukaryotic splicing, the qualitatively different evolutionary forces acting on intron structures across modern lineages, and the impressive evolutionary malleability of eukaryotic gene structures.


Introduction
Spliceosomal introns are genomically-encoded sequences that are removed from RNA transcripts by the spliceosome, a massive RNA-protein complex. The spliceosome and spliceosomal introns are common and ancestral to eukaryotes [1][2][3][4], however spliceosomal organization shows striking divergence across species. Intron number per genome differs by orders of magnitude, from fewer than ten known introns in the genomes of some protist species [4,5] to nearly ten per gene in some metazoans [6,7]. Introns also vary dramatically in length, from the 'bonzaied' 19 nt introns of the Bigelowiella natans nucleomorph [8] to the giant kilobases-long introns of humans and other mammals.
Intron sequence elements, which are important for intron recognition by the RNA and protein components of the spliceosome, also vary significantly across species. In some species, the consensus of an intron sequence is largely restricted to an initial 59 GT dinucleotide (the ''donor'' site), a terminal 39 AG (the ''acceptor''), and a degenerate few nt ''branch point'' site, lying somewhere within the intron. In other species, sequences are more conserved. For instance, in the apicomplexan parasite Cryptosporidium, 84% of introns begin with the most common sixmer GTAAGT, enabling close complementary base pairing with the U1 RNA of the spliceosome (compared with only 14% for humans; see examples in Figure 1). We previously studied the phylogenetic pattern of conservation of 59 intronic sequences and found a strong correspondence between species with very few introns and those with such strong 59 consensus sequences [9].
Clear differences are also observed in other structures. For instance, branch points vary across species from a highly conserved seven-mer (TACTAAC) in hemiascomycetous yeasts (e.g., Saccharomyces cerevisiae) to several lineages in which previous studies have failed to find branch points [10,11]. Here we report studies of the evolution of 39 intron structures including the branch point, poly-pyrimidine tract and 39 splice site across 50 species spanning all major eukaryotic kingdoms (opisthokonts, amoebozoans, red and green plants, chromalveolates, excavates and rhizarians, Figure 2).

Branch Point Consensus Strength in Intron-Poor Species
The branch point is an internal intronic sequence that initiates the splicing event through a hydrophilic attack by an adenosine 29 hydroxyl group at the 59 splice site [12,13]. As with donor sites, branch point consensus strength varies across species: 84% of Saccharomyces cerevisiae introns utilize the sequence TACTAAC and 94% have ACTAAC, compared to fewer than 20% of human introns with the exact same sixmer at the branch point site (Figure 1), and with Caenorhabditis nematodes, where no branch points have been identified.
We first studied intron branch points in available genomes from intron-poor species (Table 1; defined, as before, as those species with ,0.1 or fewer introns per gene on average [9]). Many of these species exhibited clear branch point consensus sequences. First, we found that the S. cerevisiae branch point consensus ACTAAC (throughout, the putative branch point A is underlined) is found in all fully-sequenced hemiascomycetous yeasts, with 66-100% of introns in each species containing this motif. Extended branch points of up to eight bases were found in some species (Table 1). We also found an extended single ACTAACC branch point motif in all 26 known introns of the red alga Cyanidioschyzon merolae. Other intronpoor species containing strong putative branch point motifs were the two excavate species: 32/34 (94.12%) probable introns (see Materials and Methods) in the flagellated protist Trichomonas vaginalis use WCTAAC, and all five known introns in Giardia lamblia (four published plus one unpublished instance) contain CTRACA. Finally, excluding two questionable predicted introns (see Discussion), all 13 introns in the microsporidium parasite Encephalitozoon cuniculi contain a TAAYTT hexamer (9/13 have CTAAYTT). Thus all lineages with strong branch points conformed to a general WCTRAYN consensus.
However, in at least one intron-poor lineage we failed to find such clear branch point sequences. Visual inspection and computational analysis (see below) of the intron-poor apicomplexan parasite Cryptosporidium parvum failed to reveal a potential branch point site. The case for the few introns of the Guillardia theta nucleomorph is less clear. 14/16 predicted G. theta introns contain a YAAY branch point-like sequence between 2 and 6 nucleotides from the acceptor AG (compared to only 2/16 that contain such a motif within the next 10 positions (7 to 17)). Intriguingly, a second YAAY signal exists further upstream -8/16 introns have a YAAY 24-28 nt from the 39 terminus. The single known intron in each of the sequenced Trypanosomatid genomes does not show a clear branch point structure; however it is of note that despite having few cis-spliced introns, these species do have large numbers of 39 splicing boundaries due to the ubiquity of spliceosome-mediated spliced leader trans-splicing [14,15].

Author Summary
The spliceosomal introns that interrupt eukaryotic genes show great number and sequence variation across species, from the rare, highly uniform yeast introns to the ubiquitous and highly variable vertebrate intron sequences. The causes of these differences remain mysterious. We studied sequences of intron branch points and 39 termini in 50 eukaryotic species. All intron-rich species exhibit variable 39 sequences. However, intron-poor species range from variable sequences, to uniform branch point motifs, to uniform branch point motifs in uniform positions along the intronic sequence. This is a more complex pattern than the clear relationship between intron number and 59 intron sequence uniformity found previously. The correspondence of sequence uniformity and intron number extends to species of the green algal genus Ostreococcus, in which the single intron-rich genomic region shows far more variable intron sequences than in the otherwise intron-poor genome. We suggest that different concentrations of spliceosomal complexes may explain these differences. In addition, we report the existence of 39 polyT tails in diverse eukaryotic protists, suggesting that this structure is ancestral. Together, these results underscore the complexity of ancestral eukaryotic splicing, the qualitatively different evolutionary forces acting on intron sequences in modern eukaryotes, and the impressive evolutionary malleability of eukaryotic genes.

Branch Point Consensus Strength in Intron-Rich Species
We next studied conservation of branch points in 33 more intron-rich species. We studied the occurrence of motifs conforming to the consensus branch point WCTRAY or CTRAYN with varying levels of two-fold degeneracy (i.e. allowing two possible nucleotides at each site). For instance, the ACTAAC hexamer common to all C. merolae introns has no degeneracy, whereas the CTRACA motif of G. lamblia has one degenerate site, and the general consensus WCTRAY contains three two-fold degenerate sites. As with the other sites, either one or two nucleotides at a time were allowed at the final ''N'' site (CTRAYA, CTRAYC, CTRAYR, CTRAYY…).
For each species, and for each level of degeneracy (zero to three degenerate sites), we calculated the fraction of introns that contain the same motif within the last 200 nt of the intron (see Methods). Then, for each level of degeneracy, we identified the motif that was present in the largest fraction of introns for that species. These values are given in Table 2. As expected, the intron-poor species discussed above give high values at all levels of degeneracy. All intron-poor species with strong consensus discussed above have values of at least 72% for one-site degenerate branch point motifs.
By contrast, every studied intron-rich species shows much lower scores, with lower than 22% of introns with the same putative branch point motif (for example, 8.73% (ACTAAT) in Drosophila melanogaster or 10.97% (ACTGAC) in Aspergillus fumigatus), and less than 36% allowing one degenerate site. This relation between intron numbers and branch point consensus strength is underscored in Figure 3. The species are clearly distributed in two main groups (with two exceptions, G. thetha NM and C. parvum (red asterisks)): intron-rich/ weak branch point consensus and intron-poor/strong branch consensus (intron-poor, fewer than 0.15 introns per gene; strong branch points, same BP-like hexamer in more than 50% of introns (red lines in Figure 3A). Overall, there is a negative correlation (r = 20.75) between intron numbers and branch point consensus conservation (by a linear regression analysis, Figure 3B).

The Peculiar Case of Ostreococcus
Only the species Ostreococcus lucimarinus appeared to represent an intermediate between strong and weak branch point species, with Figure 2. Phylogenetic relationship between the species included in this study. Consensus phylogenetic tree of the species included in this study. Red species names and discontinuous branches indicate intron-poor species. Note that representatives of all eukaryotic supergroups have been included in this study. Based on [47,48]. doi:10.1371/journal.pgen.1000148.g002 41.15% of predicted introns containing the same branch point-like ACTGAC sequence (Table 2). However, given the high evolutionary distance of O. lucimarinus from other species with fully-sequenced genomes and the relatively small number of available transcript sequences, annotation in O. lucimarinus and distant congenitor O. tauri is difficult and some intron predictions may thus be unreliable. To identify a confident set of introns, we performed BLASTN searches of the predicted intron-containing coding sequences against the 17,592 available O. lucimarinus EST sequences. Further computational and manual filtering for ambiguities and potential problems associated with reverse transcriptase [16,17] yielded a total of 560 confirmed intron sequences.
The confirmed introns show a stronger signal, with 50.7% containing the branch point sequence ACTGAC (and 37.9% showing an extended branch point GACTGACG). The pattern becomes even more striking when intron sequences are divided along the lines of the previously reported genomic heterogeneity of O. lucimarinus, with roughly half of chromosome 2 differing from the rest of the genome in a variety of ways including much higher intron density [18]. Confirmed chromosome 2 introns show very weak branch point signal, with only 4.6% sharing the same sixmer (CTGACG), while 87.2% of introns outside of chromosome 2 contain ACTGAC (Table 3), and 66.5% contain an extended GACTGACG motif. Confirmed introns outside of chromosome 2 also show very strong 59 splice sites consistent with intron-poor structures (4.8 bits), whereas confirmed chromosome 2 introns show much weaker boundaries (1.9 bits) ( Table 3). Notably, 59 splice boundaries outside chromosome 2 exhibit the atypical consensus GTGCGTG, whereas chromosome 2 introns prefer a more typical GTRNGT.
In contrast to O. lucimarinus, predicted introns in O. tauri show far lower conservation of branch points ( Table 2). The lack of EST sequences for this species makes confirmation difficult, however a few putative intron sequences identified by TBLASTN searches of intron-containing non-chromosome 2 O. lucimarinus genes against the O. tauri genome exhibited sequences similar to O. lucimarinus introns, suggesting that O. tauri may exhibit a similar pattern. Given this seeming discrepancy between all predicted and confirmed O. tauri introns, we excluded O. tauri from the analysis.

Acceptor Site Conservation
Next, we studied acceptor sequence conservation within genomes. The 39 sequences of spliceosomal introns generally show more similarity across species, with most species showing a terminal YAG, sometimes preceded by a poly-pyrimidine tract (Figure 1), although most fungal species and some others lack the poly-pyrimidine tract [19]. A survey of 50 widely diverged eukaryotic species showed 5 clear exceptions to this pattern. Three species, the protists T. vaginalis and G. lamblia, and Yarrowia lipolytica, a hemiascomycetes fungus (relatives of the baker's yeast S. cerevisiae), show strikingly similar patterns ( Figure 4). The three species lack the 39 polypyrimidine tract, and show clear anchoring of the branch point site at a specific number of basepairs away from the 39 terminal AG (branchpoint-AG (BP-AG)) distance; 2 nt (AC) in Y. lipolytica, 4 nt (ACAC) in T. vaginalis and G. lamblia; Figure 4).
Non-chromosome 2 O. lucimarinus introns show a preference for CGCAG, though notably weaker than that found in the in the other three species (Figure 4). Interestingly, O. lucimarinus introns' conserved 39 boundaries are associated with conserved BP-AG distance, as branch points for confirmed non-chromosome 2 O. lucimarinus introns show a broad peak ranging from around 20-35 bp (data not shown). It is interesting then that both O. lucimarinus and the constrained BP-AG distance species prefer a C at position -5 and a R at position -4. We were unable to find an explanation involving snRNA sequences for this preference. Finally, C. elegans introns also show stronger 39 consensus, matching TTTCAG with nucleotides -6 and -5 significantly more conserved (Figure 4), as had been previously shown [20].

Evolution in Hemiascomycetous Yeasts
To better understand this pattern, we next studied available relatives of Y. lipolytica. We studied all six hemiascomycetes species with full genome annotations, as well as three additional Candida species for which some intron sequences were available. The species show pronounced differences, with three (including S. cerevisiae) showing large variations in BP-AG length and five species showing clear BP-AG constraint (differences in BP-AG length across hemiascomycetes was previously reported in [10]). In Debaryomyces hansenii, 65% of introns show a BP-AG distance between 6 and 8, and 88% of introns have a BP-AG distance between 5 and 10 nt. 75% of introns in Eremothecium gossypii with well defined branch points have BP-AG between 6 and 9 nt, with 66% between 6 and 11. The small number of available introns in C. lusitaniae and C. guilliermondii suggested preferred BP-AG distances of 4-5 and 3 nt, respectively. This BP-AG constraint could partially reflect differences in intron lengths, as mean/median lengths are lower for some of these species across the clade ( Figure 5). However, the species with the clearest pattern of constraint, Y. lipolytica, has rather long introns relative to the other species.
Intriguingly, in E. gossypii introns, sequences between the BP and intron terminus varied considerably based on the BP-AG distance. The 30 introns with a BP-AG distance of 6 nt (the shortest distance with more than a few introns) showed a strong sequence consensus at the 39 end of the sequence, with 80.0% having a G at   introns with larger BP-AG distances. Y. lipolytica did not show differences in strength of 39 sequence consensus for different BP-AG distances.
Mapping of BP-AG distances across hemiascomyces shows a complex phylogenetic pattern ( Figure 5). The five species with strong BP-AG constraint are intermingled on the tree with the three less constrained lineages, suggesting convergent evolution of BP-AG constraint. Importantly, the ''preferred'' BP-AG distance varies across species -for Y. lipolytica, the most common BP-AG distance is 2 nt, compared to 3 in C. guilliermondii, 4-5 for C. lusitaniae, 6-8 for E. gossypii and 7-8 for D. hanseii.
It seems unlikely that a species with a strong preference for a certain BP-AG distance would convert to a different distance, since this would require indels of a very specific length occurring in dozens of already constrained introns. It seems more likely that this condition reflects ancestrally relatively unconstrained BP-AG distance, and convergent evolution of constrained BP-AG distances in different lineages.

Introns in Intron-Rich Species Orthologous to Retained Introns in Intron-Poor Species
Convergent evolution of retained intron sequences in intronpoor species is likely due to a combination of two factors: preferential retention of introns with consensus-like (i.e. strong) sequences and change of retained intron sequences to consensus boundaries. However, the relative impact of these two factors is unknown. We attempted to address this issue by identifying introns in intron-rich species that were present at the exact homologous position to introns in any available intron-poor species, and thus are likely to be ancestral to both species. If strong consensus sequences in intron-poor species are due to differential retention of introns with conserved ancestral sequences, it is possible that orthologous introns in intron-rich species could retain some of this signal. For each intron-rich species (the apicomplexan Toxoplasma gondii, H. sapiens, S. pombe, A. fumigatus, and A. thaliana), we compared 59 splice site strength and branch point conservation between introns putatively orthologous to introns retained in at least one intron-poor species and the total set of introns in these species. Significant differences for both intron structures were found for A. fumigatus and T. gondii introns (Table 4).
Despite this analysis being perhaps the most direct way to test the hypothesis of preferential intron retention available, it is deeply undermined by the great phylogenetic distances between intronpoor species and even their closest relatives (T. gondii-C. parvum and A. fumigatus-hemiascomycetes diverged both many hundred million years ago) and associated large amounts of sequence change. Therefore, the finding of a positive signal for any of these  comparisons is surprising and intriguing. To further test whether the observed stronger intron consensus sequence signal in intronpoor organisms could truly reflect retained greater boundary strengths from ancestor, we further divided the orthologous set into those introns shared with a more closely related intron-poor species (Y. lipolytica for A. fumigatus; C. parvum for T. gondii) and those shared only with distantly related species. Whereas the former could conceivably retain some specific ancestral signal due to lack of change, the second set represent divergences dating back upwards of a billion years ago to early eukaryotic evolution, seemingly precluding similarities in trends across individual intron boundary strengths representing lack of sequence change since  that time. Unexpectedly, we observed that this second subset of introns, shared only with older relatives, show stronger signal than those introns shared with the closest intron-poor ancestor in T. gondii (Table 4). This suggests that boundaries with greater strength in intron-poor species does not reflect retained ancestral signal.
(There was an insufficient number of introns in the A. fumigatus distantly-related group for comparison).
Another argument also argues that these intron subsets' stronger boundaries reflects not retention of ancestral boundary strength, but something else: A. fumigatus and Y. lipolytica exhibit different 59 splice site consensus sequences, thus while A. fumigatus introns with homology to retained Y. lipolytica introns do show greater homogeneity in 59 splice site boundaries (matching the consensus GTAAGT), they do not more closely resemble Y. lipolytica boundaries (consensus GTGAGT). Indeed, we observe the opposite trend: only 15.2% (26/171) of the A. fumigatus introns shared with Y. lipolytica have a G in position +3, whereas in the whole set of A. fumigatus introns, 21.4% have a G in that position.

PolyT Distributions
Finally, we studied the distribution of characteristic intronic polyT motifs along intron length. For each species, we calculated frequencies of intronic minimal ''polyT motifs'' (following previous studies, we define these as six consecutive nucleotides containing at least 3 T's and no A's [19,21,22]) as a function of distance from the acceptor site.
Almost all species conformed to one of 3 broad patterns, which tend to be conserved within large phylogenetic groups ( Figure 6). For the most common distribution (found in metazoans, plants, most apicomplexa and the heterokont Phytophtera species) polyT motifs concentrate near the intron terminus ( Figure 6A). The 59 limit of the distribution is likely determined by branch point position in some species (,30 nt, similar to mean BP-AG distance 31.5 nt in mammals [23] and 27.6 nt in Arabidopsis thaliana [24]). In other species (Caenorhabditis elegans, Ciona intestinalis, Drosophila melanogaster, and the apicomplexan Theileria parva) polyTs are concentrated in the last ,10 nt, and are underrepresented 10-15 nt from the terminus. This pattern could suggest a more 39 branch point position, although branch points in these species are difficult to determine [25]. The rhizarian Plasmodiophora brassicae seems consistent with this broad pattern; however, the small number of available introns renders confident conclusions difficult.
Second, in most fungi, polyTs are roughly equally common across the intron (with the exception of the position of the branch point site, which typically falls ,13-25 nt from the terminus [19]) ( Figure 6B). This pattern resembles that found in some intron-poor species, including the T-rich introns of C. parvum as well as the more moderate T-rich introns of hemiascomycete yeasts ( Figure 6C). S. cerevisiae shows a partial exception to the pattern, with a pronounced peak ,10 nt from the terminus.
The third pattern is found in the two fully-sequenced amoebozoans (Dictyostelium discoideum and Entamoeba histolytica) and in the intron-poor fungus Ustilago maydis ( Figure 6D). This pattern shows a single peak in polyT occurrence, centered between 15 and 40 nt from the acceptor site.
Although the numerous exceptions make firm conclusions difficult, the broad phylogenetic distribution of the first pattern (in animals, plants, apicomplexans and heterokonts, and perhaps rhizarians), suggests that the ancestral intronic structure had polyT motifs concentrated between the BP and the terminal AG, and that broader polyT distributions evolved early in fungal evolution.

Evolutionary Convergence to Highly-Conserved Intron-Exon Structures in Distant Eukaryotes
We report convergent evolution of strong branch point consensus sequences and constrained branch points positions in eukaryotic lineages ranging from fungi to plants to protists. These observations join our previous findings of convergent evolution to strong 59 splice site boundaries [9], as well as the pattern of recurrent nearly-complete intron loss, as examples of convergent intron-exon structure evolution across eukaryotes [26]. Interestingly, these five patterns appear to be closely related. Those lineages that are highly derived in intron number, having lost most of their introns, are the same ones that exhibit constraint of their few remaining introns' sequences.
However, different intron sequence characteristics show different degrees of co-evolution with intron number. Whereas strong splice sites show a one-to-one correspondence with low intron number across species [9], only a subset of intron-poor lineages have strong branch point sequences, of which only a subset have highly constrained branch point positions. Thus while intron paucity may be necessary for the emergence of branch point sequence and position constraint, it is not sufficient.
Difference in levels of intron structure constraint likely is associated with (even perhaps driven by; see below) changes in the spliceosomal machinery that have led to increased requirements for adherence to consensus sequences [9,27]. Indeed, the intronpoor species whose splicing apparatus has been most extensively studied, S. cerevisiae, shows considerable alterations in the mechanisms and protein components of its spliceosome [2,19]. Future work should explore spliceosomal changes in other intronpoor lineages, in particular the possibility in evolutionary convergence in spliceosomal machinery across lineages.
Notably, we failed to find intermediate stages. 59 splice site strength shows a clear gap between intron-poor lineages (at least 5 bits of information content), and intron-rich (1-4 bits) [9]. Almost all species have either clear and strong branch point consensus (66-100% introns with same branch point hexamer) or much weaker conservation (,30%). Branch point position also seems bimodal, as clearly seen among the hemiascomycetous yeasts, where either .80% of branch points fall within a few base pairs, or fewer than 40%.
For 59 splice sites, this lack of intermediates is consistent with some qualitative difference in the selective regimes acting within intronpoor and intron-rich species, leading to a lack of intermediate strengths. For branch point sequences and positions the case is more subtle. Do weak branch points in some intron-poor lineages reflect an ongoing process, or are these lineages somehow refractory 39 intron convergence? Repeated evolution of constraint in hemiascomycetes may suggest an ongoing process, however in this case we might expect to observe intermediate stages. Possibly, once put in motion, intron structure constraint proceeds rapidly, which could explain the lack of observed intermediates.

Opposed Intron Structure Evolution in Two Classes of Reduced Eukaryotes
Widespread sequencing has underscored the complexity of eukaryotic genome structure. While some genomes seem generally complex (with large numbers of genes containing numerous long introns and ubiquitous transposable elements) or simple (with short intergenic regions flanking a modest complement of nearly intronless genes), many genomes defy such straight-forward characterization. Intron-exon structures provide a clear example: the three genomes with the shortest known intron structures each have relatively high intron densities (Paramecium tetraurelia, the nucleomorph of B. natans, and Dicyemids, so-called mesozoans), whereas introns in very intronpoor species are not particularly short (Table 2).
Interestingly, these two classes of reduced lineages appear to show opposed patterns of intron evolution. Whereas intron-poor lineages tend towards highly-constrained intron sequence elements, short-intron lineages seem to show very weak sequence constraint. Available Dicyemid introns give the weakest known score for 59 intron boundaries (0.5 bits) and 59 splice sites of P. tetraurelia and B. natans are largely restricted to GT(A). These three lineages also show no signature of branch points (Table 2). This does not simply reflect an inability of short introns to accommodate branch points or reduced splicing constraints associated with short introns per se: E. cuniculi introns (35.8 nt on average) and many T. vaginalis introns (,25 nt) are short, yet both show conserved 59 splice site and branch point sequences (note that this also suggests that species with both types of genome reduction exhibit strong consensus, reflecting their intron paucity). This pattern underscores the importance of intron number, and not simply genome reduction, in driving the emergence of strong consensus sequences.

Hypotheses for Intron Convergence and a Natural Experiment in Ostreococcus
The finding of a general inverse correspondence between intron number and splicing signals' strength is unexpected and remains unexplained. Previously, we suggested that in intron-poor species, selection against aberrant splicing of cryptic splice sites would drive changes in the spliceosome towards stricter splicing requirements, which would in turn drive sequence change in (or loss of) non-conforming introns [9]. In intron-rich species, this evolutionary pathway would not be available since increased spliceosomal strictness would imply deleterious inefficient splicing of a much larger number of non-consensus introns [9].
The genome of the ultra-small green alga Ostreococcus lucimarinus provides a rare natural experiment to test this hypothesis. While genes in the majority of the genome exhibit very low intron densities, the genes spanning roughly half of chromosome 2 show a much higher density, well within that of ''intron-rich'' species [18]. That the two sets diverge so clearly in level of intron sequence constraint is clearly not predicted by general alterations of splicing strictness due to changes in a (assumed) single spliceosome.
One possibility is that the two intron sets are serviced by different spliceosomes (as is the case of U2 and U12 in different eukaryotic lineages). However, a computational search turned up only single copies of spliceosomal RNA components in congenitor O. tauri [28]. Conceivably snRNA changes to complement the divergent Ostreococcus 59 splice sites and branch points (GTGCGTG and GACTGACG in O. lucimarinus) could have thwarted their identification in the previous study, and this possibility is worth exploring. Alternatively (though more difficult to test), a single core RNA splicing machinery could associate with different sets of protein components in distinct spliceosomes with different splicing activities.
More likely, however, O. lucimarinus contains a single spliceosome, strongly suggesting that the constrained introns through most of the genome, as well as those of intron-poor species that they so closely resemble, do not reflect inherent changes in the spliceosomal machinery. A simpler possibility is that differences in local (in O. lucimarinus) or cellular (in other species) concentrations of spliceosomal complexes is the driving factor. It seems possible or even likely that spliceosomal complexes in intron-poor species are downregulated. Such downregulation could reflect either selection to reduce incorrect splicing of truly exonic sequence (i.e. fewer spliceosomes, less chance of false splice boundaries being spliced), or could be favored by reducing energetic costs associated with transcription, processing, and translation of spliceosomal components. If so, the lowered concentration of spliceosomal components would require stronger binding affinity of individual splice sites to corresponding snRNAs for efficient splicing, which would in turn drive the evolution of stronger boundaries (or intron loss). Differential local concentrations across genomic regions in Ostreococcus could be maintained if spliceosomes were preferentially recruited to the intron-rich genomic region. This scenario is similar to our previous hypothesis in invoking a tradeoff between the costs of efficient splicing of weak boundaries (maintenance of high spliceosome concentration) and the costs of mis-splicing, which we argue would likely be different in intron-rich and -poor species. This hypothesis makes the testable prediction that spliceosomal components should show reduced expression in intron-poor species relative to intron-rich species.
Another hypothesis concerning the concentration of spliceosomal complexes, suggested to us by Tony Russell, sees a very different role for selection. A predicted consequence of increasing the length of snRNA-intron element base-pairing interactions is a reduction in the overall rate of splicing, simply due to a tighter association between intron and spliceosome. Such a decrease in splicing rate will be tolerated in intron-poor species if spliceosomal components are in excess relative to the number of introns. However, in intron-rich species spliceosomal components may not be in excess, in which case stronger base-pairing between intron and snRNAs could be disfavored. Notably, these two hypotheses make qualitatively different predictions. Whereas the latter hypothesis predicts that strong boundaries would be disfavored in intron-rich species, the former predicts that they would as or more fit than weaker boundaries. Comparative analysis of closely related species to test these predictions is underway.

Sequence Convergence versus Preferential Loss
Two factors could drive evolutionary convergence to strong intron boundaries: sequence changes in existing intron sequences to consensus sequences, and preferential loss of non-consensus introns. The relative contributions of these factors may depend on the precise evolutionary pathway from (ancestral) genomes with many introns with weak boundaries and relatively lax splicing requirements to fewer introns with strong boundaries and stricter requirements.
First, widespread (mostly random) intron loss could lead to selective conditions favoring the evolution of a spliceosome with stricter sequence requirements for splicing (as argued above and in reference [9]). Introns with non-consensus sequences would then impose a burden, which could be resolved by sequence change or intron loss. If intron loss rates in these lineages are at least comparable to substitution mutation rates (for instance, 90% intron loss over 500 million years is consistent with a constant loss rate of 5610 29 per year, comparable to some estimated mutation rates [29,30]), preferential loss could play an (or even the) important role in convergence. In this case, intron loss would be a self-catalyzing process, with widespread intron loss leading to increased splicing requirements driving yet faster intron loss.
Alternatively, increased splicing requirements could come first, driving intron loss (consistent with [31]). However evolution of stricter splicing requirements in intron-rich organisms, where there are large numbers of non-consensus introns, would lead to widespread deleterious mis-splicing. Thus it is hard to imagine such strict requirements arising prior to widespread intron loss.
Finally, even under lax splicing requirements, loss of nonconsensus introns could be more highly selected due to the effects of the less efficient splicing of these introns (e.g. [32]), with stricter splicing requirements gradually enabled by loss of non-consensus introns. The viability of this scenario depends on significantly higher selective costs of suboptimal boundaries in intron-rich species respect to the optimal introns. Attempts to estimate these selective costs by comparative analysis are underway.
In any case, it seems likely that intron loss, increased splicing constraints, and intron sequence change will all be reinforcing of one another, such that (perhaps past some critical threshold), the three will proceed in tandem. Differences in the relative contributions of the three phenomena will depend on mutation rates and selective coefficients for different kinds of changes (basepair substitutions, intron loss, spliceosomal changes), which may vary considerably across times and lineages.

Convergent Evolution between U12-Type Introns and Nearly Intronless Species' U2-Type Introns
The convergent U2 intron structures observed here and elsewhere in some intron poor species -strong 59 and branch point consensus sequences, constrained BP-AG length -are strikingly reminiscent of the U12-type intron structures found across a wide variety of lineages [33,34]. One possibility is that the U12-type intron structures represent a derived state (possibly evolved in the ancestor of eukaryotes), and that these convergent cases have similar causes -that U12-type introns' low genomic number has subjected them to the same pressures as those experienced by U2-type introns in intron-poor lineages. Notably, if the conservation across lineages of specific conserved U12 intron sequence elements, as opposed to the differences in consensus structures observed in U2-type introns across diverse intron poor lineages reflects the emergence of these structures early on in eukaryotic history, this interpretation would imply that U12-type introns have been rare since very early in eukaryotic history. Alternatively, the similar structures of U12 introns in general and U2 introns in intron-poor species could have different explanations. One possibility is that the U12 system represents an intermediate between the type II introns that initially proliferated in early eukaryotic ancestors, with their highly similar sequences and structures, and typical highly degenerate U2 introns. In this case the persistence of strong consensus sequences in U12-type but not U2-type introns remains somewhat mysterious, as does the preponderance of U2 introns relative to U12.

Evidence that polyT Tails and Weak Branch Point Sites Are Ancestral
We find different patterns of polyT motif distribution along the introns in different lineages, which likely reflect differences in polyT functionality and in how polyT-binding factors regulate splicing in these lineages. Indeed, the differential conservation and evolution of spliceosomal proteins binding polyT tracts (PTB, SXL, TIA1, Nam8, etc.) in each species is likely to determine the position of polyT motifs along introns and what role these motifs play in splicing regulation.
Among these locations, we find that the so-called ''polyT tail'', spanning from a position likely corresponding to the branch point to the 39 intron terminus, is common to a wide variety of groups ranging from plants to animals to various, widely diverged, protists, strongly suggesting that the existence of a polyT tail may be ancestral. Furthermore, we show that, as in the case of 59 splice sites, strong branch point site consensus have evolved independently in only intron-poor species, whereas all intron-rich species have weak branch point sites. Since a wide variety of studies have shown that eukaryotic ancestors have harbored relatively high intron numbers [3,[35][36][37][38], our results suggest that the eukaryotic ancestors also had weak branch point site consensus, as with most modern eukaryotic groups.
These conclusions extend an excellent recent study from Schwartz et al. [27], who studied BP and polyY motifs in 19 opisthokonts and 3 other eukaryotic species, and reached similar conclusions about ancestral intron structures. Our inclusion of a more diverse set of species spanning all of known eukaryotic species allows us to reach deeper into eukaryotic evolution, potentially getting much closer to the initial origin of spliceosomal introns. In particular, we find that sequences from the first characterized rhizarian, as well as for various heterokonts, follow the patterns found in other kingdoms. This striking similarity over very broad evolutionary distances significantly strengthens our conclusions about ancestral eukaryotic splicing, rendering them independent of the placement of the root of the eukaryotic phylogeny.
The presence of these intronic features, polyT tail and weak branch point sites, in the eukaryotic ancestors adds to the developing picture of the spliceosomal system in early eukaryotes -with highly developed spliceosome, weak 59 and branch point sequences, a polyT tail and complex splicing patterns [2,9,39,40].

Quality of Predictions
One limitation of the data deserves comment. We analyze annotated intron sequences, making our analysis subject to the quality of available annotations. Problems with the annotations may very directly influence the characteristics studied here, since for instance more consensus-like sequences are more likely to be identified as introns. Such concerns may also affect comparative analysis if the annotation efforts for different species are differentially sensitive to different kinds of introns.
However, it is very unlikely that such limitations are likely to drive the qualitative differences we see here. For a ''weak'' species with for instance 30% of its true introns exhibiting the same branch point to be incorrectly identified as a ''strong'' species (with, say, 80% of predicted introns with the same branch point) would require that the vast majority of its non-consensus introns have gone unannotated. For the reverse to occur, it would be required that there were so many falsely predicted introns that it had drowned out the signal almost entirely. Thus, while it is important to point out that our results are likely not accurate to the second decimal place due to problems with annotations, such problems are very unlikely to be driving the large qualitative differences observed.

Strong 59 Splice Sites in E. cuniculi
Possibly, the most important impact of annotation errors could be to reduce the signal in very intron-poor species. For instance, further scrutiny suggested that 2 of the 15 predicted introns in E. cuniculi may not in fact be introns at all: both are a multiple of 3 basepairs and lack inframe stop codons; these two introns have the weakest 59 boundaries (matching the consensus (GT)AAGT at 1 and 2 out of 4 positions, compared to at least 3 matches for the other 13 introns); and one intron has similar sequences at the two boundaries, suggesting that this intron prediction could reflect a reverse transcriptase artifact in EST preparation [17]. Notably, in our previous work on donor splice sites [9], E. cuniculi represented the only intron-poor species lacking very clear strong boundaries. Excluding these two questionable introns, E. cuniculi has a donor site information content of 6.2 bits, comparable to the other intron-poor lineages.

Conclusion
These results attest to plasticity of spliceosomal intron structures through the history of eukaryotes. The availability of large numbers of eukaryotic genomes now allows comparative analysis of an increasing diversity of genomic structures. Present and previous works have provided an increasingly detailed picture of the patterns and determinants of intron-exon structures, one of the hallmarks of eukaryotic genome organization. Definitive identification of the causes of highly regular 39 intron structures awaits the identification of additional lineages exhibiting this pattern.

Study of Branch Point Consensus Strength
Studying the clear branch point consensus from a wide variety of intron-poor species, we first define an extended branch point consensus, WCTRAYN, consistent with the minimal consensus NYYNAN described for a wide variety of eukaryotic groups [19,23,24,46].
For each 50 species, we next studied the percentage of introns showing the most common hexamer matching this extended consensus, allowing two-fold degeneracy at zero, one, two and three sites of the putative branch point hexamer. The four measures are complementary.
We did not aim to identify and study branch points in all introns. Instead, the percentage of use of the most common motifs gives a straight-forward measure of the strength of the signal for a given species. The use of a similar approach to measure the strength of the 59 splice site (whose definition is trivial) shows a clear correspondence between the measure of strength as the percentage of introns with the most common sequence motif (i.e. GTAAGT, GTATGT, etc.) and as information content, used broadly in the literature, with a coefficient of correlation between both variables for the species included in this study is r = 0.96.

Study of 59 and 39 Splice Site Consensus and BP-AG Distance
We aligned the final 20 nt of each intron for each species using WebLogo (http://weblogo.berkeley.edu/logo.cgi). To better characterize the evolution of BP-AG constraint in Y. lipolytica, we further studied BP-AG distance in the 9 hemiascomycetes species. In all hemiascomycetes species, the vast majority of introns contained a single TACTAAC sequence, used as the branch point. The BP-AG distance was defined as the number of base pairs (Nn) for 59TACTAAC|Nn|AG-exon39.
For 59 splice sites we used a similar methodology as described in [9]. The first 6 bases of each intron were extracted, and information content in bits for positions +3 to +6 was calculated using PICTOGRAM software online (http://genes.mit.edu/ pictogram.html).

Study of Ostreococcus Introns
We downloaded available EST sequences for O. lucimarinus from NCBI on April 4 th , and performed standalone BLASTN searches for each predicted intron-containing O. lucimarinus gene against the ESTs. Preliminary confirmed introns were identified as those in which an EST hit with .60 bits and with .90% sequence identity spanned the intron position (reaching at least 3 nt on each side of the intron position). Each of these introns was then analyzed by eye to exclude non-canonical intron positions as well as those for which sequence similarity between the regions spanning the 59 and 39 predicted splice boundaries were consistent with a templateswitching artifact during the reverse transcription step of the EST library preparation [16,17].

Orthologous Intron Analysis
For each species considered, databases of intron/exon structures of predicted gene transcripts were prepared from the genome annotations. Homologs were identified by one-way BLASTP searches of intron-containing genes from intron-poor species against predicted proteomes from intron-rich species. Putatively orthologous introns were identified as those present at identical alignment positions (position and phase) in both species. Related species were defined as: C. parvum for T. gondii; Y. lipolytica (chosen as the representative hemiascomycete due to its greater intron density (3 times higher than S. cerevisiae), in order to increase sample size) for S. pombe and A. fumigatus; and the C. merolae and the G. theta nuclemorph for A. thaliana. Due to the lack of related intron-poor species, H. sapiens introns were divided into only two groups.

Search for polyT Motif Distributions
We used the minimal definition for polyT motif, defined as six consecutive nucleotides containing at least 3 T's and no A's [19,21,22]. The study of introns for each species was performed using custom Perl scripts. The last 2 and the first 10 base pairs of each intron were excluded.