A Detailed History of Intron-rich Eukaryotic Ancestors Inferred from a Global Survey of 100 Complete Genomes

Protein-coding genes in eukaryotes are interrupted by introns, but intron densities widely differ between eukaryotic lineages. Vertebrates, some invertebrates and green plants have intron-rich genes, with 6–7 introns per kilobase of coding sequence, whereas most of the other eukaryotes have intron-poor genes. We reconstructed the history of intron gain and loss using a probabilistic Markov model (Markov Chain Monte Carlo, MCMC) on 245 orthologous genes from 99 genomes representing the three of the five supergroups of eukaryotes for which multiple genome sequences are available. Intron-rich ancestors are confidently reconstructed for each major group, with 53 to 74% of the human intron density inferred with 95% confidence for the Last Eukaryotic Common Ancestor (LECA). The results of the MCMC reconstruction are compared with the reconstructions obtained using Maximum Likelihood (ML) and Dollo parsimony methods. An excellent agreement between the MCMC and ML inferences is demonstrated whereas Dollo parsimony introduces a noticeable bias in the estimations, typically yielding lower ancestral intron densities than MCMC and ML. Evolution of eukaryotic genes was dominated by intron loss, with substantial gain only at the bases of several major branches including plants and animals. The highest intron density, 120 to 130% of the human value, is inferred for the last common ancestor of animals. The reconstruction shows that the entire line of descent from LECA to mammals was intron-rich, a state conducive to the evolution of alternative splicing.

1 Supporting Methods

Homolog identification
Orthologs were identified by slightly modifying our previously described procedure  We used the eukaryotic ortholog groups (so-called euKOGs and euNOGs) from the eggNog database (Muller et al., 2010). The ortholog groups were not used directly, but only for "seeding" a homolog identification step based on bidirectional best hits between each genomes protein sequences, and the seed groups. Specifically, we used command-line tools of the NCBI software development kit (version 6.1, obtained from ftp://ftp. ncbi.nlm.nih.gov/toolbox/ncbi_tools/ncbi.tar.gz) to build BLAST databases on individual genomes and the seed groups. First, the seed groups were used to query each genome. Hits between a seed sequence and the query were retained if (1) had an E-value at most 10 −9 , (2) had an amino acid identity at least 50%, and (3) had a score within 90% of the best BLAST hit with the same query. Subsequently, the retained sequences were used to query the seed groups to filter out ambiguous group assignments. Sequences passed the filtering step if they had the highest scoring hit to the same seed group used in the initial search, and hit other groups with at most 98% of the best score.

Ortholog set selection
In a second phase, the collected paralogs were submitted to our ortholog "weaving" protocol described before  which, by reconciling the gene phylogeny with the given species phylogeny, selected at most one sequence for each species that are highly likely to be orthologs. Finally, the resulting candidate ortholog sets were further filtered by verifying their phylogenetic relationships. In particular, we required a non-negative log-likelihood ratio between the neighborjoining tree and the known species phylogeny, computed by using PhyML (Guindon and Gascuel, 2003).

Input table of intron presence and absence
Following a well-established analysis pipeline (Rogozin et al., 2003(Rogozin et al., , 2005Csűrös et al., 2008) we selected homologous intron-bearing positions by projecting the exon-intron structure onto aligned orthologous coding sequences. The resulting intron dataset is a table of intron absence and presence, in which each column corresponds to a splice site projected onto an unambiguous alignment column (retaining splice phase information), and each row corresponds to one of the 99 terminal taxa. Table entries may be 1 (splice site is present), 0 (no splice site), or "ambiguous," denoting a missing ortholog or an uncertain alignment. The table contains all columns with at most 24 ambiguous entries, and was produced using the Malin software package (Csűrös, 2008) with its default settings for ambiguity. The alignments were done using Muscle (Edgar, 2004) with its default settings.

Priors and proposals
Recall that the model is defined by the parameters θ = α gain , α loss , π, t uv , uv : uv ∈ T .
The prior distribution was uniform for every parameter: over the range [0, 10] for α · and t uv , and over the range [0, 1] for π and uv . In a typical MCMC proposal, a subset of model parameters was chosen, and then multiplied by a random value. Random multipliers were generated by an exponential distribution restricted to some range [e −σ , e σ ]. More precisely, the pseudorandom multipliers were drawn as M σ = exp U −σ,σ , using a pseudorandom uniform U −σ,σ over the interval (−σ, σ). In an Edge move, for instance, the proposed parameter vector θ was produced by setting the length of a randomly picked to t uv = t uv · M σ , and keeping all the other parameter values unchanged. The following table shows the proposals used during MCMC sampling. For Edge and Rates moves a random edge was selected uniformly. For a Resize move, a random non-terminal node was selected uniformly, and all edge lengths were multiplied with the same random M σ . For a StarRates move, a random non-terminal node was selected uniformly, and the ratios on each incident edge (children, and the parent -except for the root) were multiplied with a separately drawn random M σ .
After the burn-in period, the multiplier range was reduced by setting σ ← σ/2 for each move type. The Var move was applied only to rate variations with more than 1 discrete categories. The frequency denominator m = 82+v depends on the types of permitted Var moves: m = 82 for no rate variation across sites, m = 83 if loss or gain, but not both vary across sites, and m = 84 if both vary across sites.

Sampling
Sampling data was collected after a burn-in period of 500 thousand steps, subsampling every 1000th subsequent position, across 100 independently launched chains. Figure 1 illustrates the convergence of the log-likelihoods to the same value across all the chains. Figure 2 illustrates the mixing of the sampled reconstructions at LECA. In Figure 3, a mixing efficiency is calculated, determined by across-and within-chain variations. Sampled reconstructions at a given step are independent across chains, and therefore can be used to estimate the standard deviation for the estimated posterior distrbution. We define the mixing coeffecient as the ratio between the estimated standard deviation within a chain, and the standard deviaion of the estimated quantity. With sufficiently sparse subsampling, the coefficient approaches 1. Figure 3 illustrates the superior mixing at most ancestral nodes.    Mixing at ancestral estimates. The standard deviation (SD) was estimated within chains, and across chains. For within-chain variation, the SD was calculated from the 500 post-burnin sampled ancestral intron counts, and then averaged across the chains. For across-chain variation, the SD was calculated from the 100 chains at each MCMC step, and then averaged across the steps.

Mixing efficiency for ancestral reconstruction
3 Ambiguity at Amoeoboza and Chromalveolates Figure 4 illustrates that the ambiguity at the amobozoan ancestor is clearly caused by the impossibility of inferring the timing of intron losses. Figure  floatnbr5 illustrates the more complicated dependencies between chromalveolate ancestors. The tree below facilitates the interpretation by showing the assumed phylogenetic relationships between protists, and the naming of ancestral nodes. Table 6 shows the lack of signal in chromalveolate algae, and the strong intron conservation seen in human introns.  Figure 4 Sampled ancestral reconstructions at the ancestor of Amoebozoa in an MCMC trace. The plots show introns lost on the branches (":l") to the inferred number of introns in the Amoebozoan ancestor ("Amoebozoa:p"). The coloring of the dots indicate when the state was visited.

Figure 5
Sampled ancestral reconstructions at the ancestors of chromalveolate algae an MCMC trace. The plots show introns lost on the branches (":l") or inferred to be present (":p") at various positions. The coloring of the dots indicate when the state was visited.

Table 6
Introns in shared positions between some nodes in the data set. The first data row in each subtable gives the reference taxon and the number of sites in which the site has an intron in the reference group's members. The subsequent rows list the intron-sharing patterns between the reference and other taxa t. "Shared" sites have introns in both taxa; "reference-specific" sites have introns in the reference taxon, opposing absent and ambiguous entries in t. "Taxon-specific" sites have introns in t but not in the reference taxon. Notice that the counts include only unambiguosly aligned sites, and, thus, they may not sum up to as expected. For instance, only 654 of 780 (84%) human introns can be unambiguously projected onto homologous amphioxus sequence, and 537 of those amphioxus positions correspond to a splice site. For higherlevel taxa, "sharing" was established by at least one terminal descendant taxon: e.g., 568 human introns are projected to an annotated splice site in at least one arthropod genome. Percentages in parentheses are calculated with respect to all introns for the taxa in question: either the percentage of introns in the reference ("15% of all human introns do not align with amphioxus introns), or those in t ("20% of amphioxus introns do not align with human introns).

Ancestral phase distributions
Individual splice site histories were reconstructed using the maximum-likelihood model found during sampling. Figure 7 and Table 8 show the reconstructed phases at sites represented in terminal taxa.

Figure 7
Intron phase distribution at different nodes. Branch widths are proportional to (inferred) intron density. Pie charts show the relative frequencies of phase-1, -2 and -0 splice sites. Notice that the reconstruction covers only the splice sites in the data set (where phase can be established), and not the extrapolated allabsent profiles.

Model fit
Model fit was assessed by computing the the harmonic average of the likelihood in a trace, after the burn-in period, denoted as H. In the limit, the statistic H is proportional to the so-called marginal likelihood, and the ratio of H between two models may serve as a proxy for the Bayes factor (Newton and Raftery, 1994). The distribution of the statistic was computed from the 100 MCMC traces, in order to interpret the significance of differences seen between H of different models.
In one set of experiments, we selected the rate variation model with a trifurcating LECA. As Figure 9 shows, multiple gain categories do not improve the models dramatically. Multiple loss categories give significant improvements, but the returns are diminishing beyond 4 categories. The most likely root resolution (CU,P) is the bifurcation between plants and the other two supergroups, but the average likelihoods do not differ more than between the trifurcating traces.

Figure 9
Model fit for different models and phylogeny rootings. The X axis plots the natural logarithm of H for different models, computed from 500 subsampled MCMC steps. Rate variation models are denoted by LxGy for x loss and y gain categories. The shaded area shows the empirical cumulative distribution function (cdf) of H in 100 traces with the L4G1 model chosen for the ancestral inference. Alternative model scores are projected onto the cdf for establishing significance levels: for instance, L5G1 has a score H surpassing 64 of 100 L4G1 traces. Different LECA rooting models (tri is the rooting chosen for the inference) are denoted as shown below.
6 Ancestral inference Figure  S10.i-xcvii Ancestral density and confidence intervals. The plots show the posterior distribution of the ancestral intron density inferred from the sampling chains. On each plot, the horizontal red line shows the median (by the dot) and the 95% (±47.5%) credible interval around it estimated from 50000 subsampled MCMC steps. The Dollo parsimony and maximum-likelihood (ML) reconstructions are also shown.

Figure 11
Number of ancestral introns in different reconstructions. The plots compare the number of ancestral introns inferred by MCMC sampling (median in 50000 subsampled steps) with the inference by Dollo parsimony, and by the maximum-likelihood (ML) model found during sampling.

Figure 12
Lineage-specific loss and gain rates (μ uv andλ uv ). On both plots, the X axis shows the median rate seen in the 50000 MCMC-sampled steps. The Y axis plots the rate in the maximum-likelihood model. The two models do not differ much in their parameters.

Simulations
The consistency of different inference methods was assessed by generating 100 random simulated data sets using the median model inferred from the MCMC sampling (same sampling settings as before) on the main data set ("Same model").
In particular, we generated 7200 random intron absence-presence profiles, and performed ancestral reconstruction using Dollo parsimony, 100 independently run MCMC samples, and the maximum likelihood model found using sampling. The reconstrauction was compared to the known ancestral intron counts. In order to assess the methods' robustness, we performed 100 additional simulations using a "Mixed model,' in which model parameters were multiplied by randomly drawn random variables. At each intron site, the root presence probability (π), as well as the site-specific rate multipliers (γ j , ν j ) were multiplied by random lognormal variables ln N (0, σ 2 ) with σ = 0.2 for π, σ = 0.1 for the gain rate multiplier ν j , and σ = 0.3 for the loss rate multiplier γ j . Concomitantly, the edge length parameters were also multiplied independently (among themselves, and across sites) by exponentially distributed random variables with mean 1.
In a final set of experiments, we assessed inference robustness with respect to missing orthologs. Using 100 simulated sets, we replaced randomly chosen entries with ambiguity. Namely, we split the sites into blocks of 34 (average number of sites per gene in the main data set), and erased presence-absence information independently within each block by 13% probability.
We chose the number of simulated intron-bearing sites in each set of experiments to yield similar number of informative entries as the main data set (containing 14% ambiguous entries), generating fewer sites without ambiguity (7200 sites for "Same model" and "Mixed model"), and to 9000 sites with "Missing orthologs" (since some sites without unambiguous intron presence are filtered out in the inference).

Figure 13
Estimation error in simulated data at some key nodes. Each point on the plots shows the actual number of ancestral introns along the X axis, and the estimation error (actual number subtracted from inferred number) along the Y axis. The 25% error level is shown by the slanted dotted lines. The MCMC and ML procedures have very small empirical bias, but Dollo parsimony is typically completely off target.

Naegleria gruberi
We identified 68 homolog groups represented in the Naegleria gruberi genome (Fritz-Laylin et al., 2010), using bi-directional best BLAST hits in the same way as for the main data set ( §1.2). For groups with multiple Naegleria paralogs, we selected the one with the highest BLAST score. The 68 selected Naegleria genes were aligned to the multiple alignment profile constructed for their respective homolog groups (without recomputing the alignment between the 99 eukaryotes of the original data set). The alignments scored aligned intron positions alongside protein sequences, as described before (Csűrös et al., 2007). The intron presenceintron table was computed from the alignments in the same way as for the main data set ( §1.4). The resulting table comprises 4812 intron sites with 141 ambiguous Naegleria entries. The data set contains 97 aligned Naegleria introns, from which 57 fall into a homologous position with at least one other eukaryote. Twenty-three of those are inferred to have been absent at the trifurcating LECA (average posterior presence probability 9%), and 34 are inferred to have been present (average probability 95%). At the chromalveolate and unikont ancestors, the numbers are the same. The shared Naegleria intron counts do not differ significantly across deepbranching unikonts (Capsaspora, Dictyostelium, Allomyces), either. Two introns are inferred to have been inserted in parallel between Ngru and the Metazoan ancestor. At the common ancestor of plants and green algae, only 26 Naegleria introns are inferred to have been present.

Table 14
Naegleria introns projected onto the data set. The table lists the number of introns shared between Naegleria and other eukaryotic taxa. The table rows list the intron-sharing patterns between the Ngru and other taxa t ("Taxon" column). "Shared" sites have introns in both taxa, "Ngru-specific" sites have introns in Naegleria, opposing absent and ambiguous entries in t. "taxonspecific" sites have introns in t but not in Ngru. Notice that the counts include only unambiguosly aligned sites, and, thus, they may not sum up to as expected. For instance, only 88 of the 97 Ngru introns can be unambiguously projected onto homologous Dictyostelium sequences, and 25 of those correspond to a splice site in at least one Dictyostelium genome, the remaining 63 are "Ngru-specific."