MY conceived and designed the experiments, contributed to and coordinated the analyses of the data, contributed analysis tools, and wrote the paper. CJM, CS, SP, JK, and GH contributed to the analyses of the data. CJM, CS, SP, JK, GH, and SL contributed analysis tools. GMR contributed to experimental design and writing the manuscript.
¤a Current address: Department of Human Genetics, Eccles Institute of Human Genetics, University of Utah, Salt Lake City, Utah, United States of America
¤b Current address: U.S. Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America
¤c Current address: Department of Bioinformatics, Genentech, South San Francisco, California, United States of America
The authors have declared that no competing interests exist.
We have used the annotations of six animal genomes
Just as protein sequences change over time, so do gene structures. Over comparatively short evolutionary timescales, introns lengthen and shorten; and over longer timescales the number and positions of introns in homologous genes can change. These facts suggest that the intron–exon structures of genes may provide a source of evolutionary information. The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein evolution. If, for example, intron–exon structures are strongly influenced by selection at the amino acid level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Using 11 animal genomes, Yandell et al. show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. This means that gene structures provide a source of information about the evolutionary past independent of protein sequence similarities—a finding the authors employ to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.
Sequence alignment and comparison have revealed much about evolution at the nucleotide and amino acid level, but much less is known about the structural evolution of genes—how their intron–exon structures, intron lengths, alternative splicing, and untranslated regions change over time. Genome annotations comprise an invaluable resource for answering such questions because they describe the essential parts of a gene and their relationships to one another [
Although the origins and mobility of introns are still subjects of debate, previous studies [
The utility of gene structures as materials for phylogenetic analyses, however, depends upon their independence from the forces driving protein sequence evolution. If, for example, intron–exon structures are strongly influenced by selection at the protein level, then using them for phylogenetic investigations is largely pointless, as the same information could have been more easily gained from protein analyses. Also needed is a better understanding of the rates at which different aspects of gene structures evolve. Clearly, more slowly evolving aspects of gene-structure—intron positions [
In order to address these issues, we have characterized the number, position, and length of introns and exons in 11 individual genomes representing four phyla. These data provide a panoramic perspective from which to investigate the evolution of gene structures on a variety of timescales—from the less than 5 million years since the divergence of
We show that evolution of intron lengths and positions is largely—though not completely—independent of protein sequence evolution. Thus, gene structures provide a source of information about the evolutionary past independent of protein sequence similarities. We use this fact to investigate the accuracy of the protein clock and to explore the utility of gene structures as a means to resolve deep phylogenetic relationships within the animals.
In order to facilitate the use of genome annotations as substrates for computational analyses, we developed an open-source software library (CGL) for comparative genomics using genome annotations. The software and a tutorial on its use are available at
CGL can convert the annotations from many different databases into a single standardized format; thus the software can be used to assemble very large repositories of annotations that encompass the contents of multiple genome databases. For purposes of the analyses presented here, we have used CGL to convert the genome annotations
The Bilaterian animals are generally classified as either protostomes or deuterostomes. In deuterostomes, the blastopore lip becomes the anus, whereas in the protostomes it becomes an anterior oral structure. The two lineages are believed to have last shared a common ancestor more than 500 million years ago, and the nematodes may have diverged from both lineages even earlier [
In order to survey gene evolution during even shorter time intervals, we also included in our dataset five recently sequenced but unannotated genomes:
As our collection of annotated genomes contained more than 100,000 annotations, we sought first to survey and summarize the contents of each genome's annotations with regards to gene structure. We choose three basic measures: intron length, exon length, and intron density. These measures provide a concise summary of the similarities and differences in intron–exon structure for the six annotated genomes. Placing these data in their phylogenetic context allows trends in the evolution of gene structure to emerge.
(A) Intron length. Annotated intron length (log10) is plotted on the
(B) Exon length.
(C) Intron density. A transcript's intron density is equal to its number of coding introns divided by the length of the protein it encodes.
We also characterized each genome with respect to coding-exon length (
One process that might explain the longer exons characteristic of the protostome genomes is retro-transposition-mediated gene duplication [
There were 13,339
atha,
In order to further investigate the distribution of introns, we have made use of a simple summary statistic of gene structure: intron density, or the number of coding introns associated with a particular protein divided by that protein's length [
While intron density is an attribute of a single annotated transcript, when applied to entire annotated genomes it can also be used to provide a summary statistic regarding the distribution of introns within a genome. Consistent with the exon-length distributions shown in
To explore these data more closely, we also examined the frequency distributions of intron density in each of the six annotated genomes (
The data in
No matter what the ancestral animal distribution may have looked like, the diversity of the present-day intron density distributions makes it certain that extensive remodeling of intron–exon structures has occurred in at least some of these genomes since the six animals last shared a common ancestor. Several lines of evidence suggest that this process has been a slow one. Current estimates of the rate of intron insertion and deletion in animal genomes have placed it at less than one event/gene/200 million years [
Next, we sought to characterize and compare the six annotated proteomes to one another with respect to protein similarities. These analyses are a necessary prerequisite for an examination of intron–exon structures in the context of protein similarities. As a first step, we preformed an all-against-all BLASTP [
In order to assay the impact of unequal rates of protein evolution on these data, we also compared the six animal proteomes to the
For purposes of further analysis, we recast the distributions shown in
Proteome-wide trends in protein similarity (A), and genome-wide trends in intron–exon structural similarity (B). Numbers beneath tree nodes are bootstrap values.
This approach to consensus phylogenetic tree construction differs from standard methods [
Our characterization of proteome-wide patterns of amino acid similarities (summarized in
Cursory examination of these HSPs makes clear two important facts. First, genome-wide trends in intron–exon structural similarities roughly parallel those of phylogeny and protein similarity. For example, 92% of human–mouse, 36% of human–
In order to address differences in intron density, we formulated a more exacting, though less intuitive, definition of intron–exon structural similarity that takes intron density into account. To do so, we calculated a log odds ratio (LOD) score for each set of concatenated reciprocal best-hit best HSPs in toto, wherein the ratio of the observed number of aligned splice junctions to the expected frequency was used as a measure of global similarities in intron positions for two genomes. To obtain the expected frequency of aligned introns, we multiplied the frequencies of introns within query and subject portions of the concatenated alignment. Thus this measure of intron–exon similarity controls for the differing frequencies of introns in the different genomes. It is also essentially identical to the standard LOD score approach used to measure protein similarities [
To summarize the results of this analysis, we recast the resulting matrix of LOD scores into the phylogenetic tree shown in
As was the case for protein similarities (
One issue not addressed by our previous analyses is the extent to which evolution of intron–exon structures is coupled to that of protein sequences. A clear understanding of the impact of protein-sequence evolution on gene structures is desirable if gene structures are to be used for phylogenetic investigations.
cele,
Although phylogeny is the primary factor structuring the data in
The finding that the rate of change in a gene's intron–exon structure is influenced by selection on the protein it encodes (
(A) Unrooted neighbor-joining tree based upon amino acid similarities for reciprocal best-hit best HSPs having 1.25 bits/aligned amino acid pair.
(B) Unrooted neighbor-joining tree based upon similarities in the intron–exon structures of those same HSPs.
Note that the tree in 5B suggests the same phylogenetic relationships as the tree shown in
Having examined the evolution of intron–exon-structures, we next sought to investigate the evolution of intron lengths. Previous work [
As
(A) Quartet orthologous intron pairs.
(B) Paralogous introns.
If this interpretation is correct, then the data in
The data in
To further investigate these questions, we turned to the six
(A)
(B) Intron lengths of paralogs having the same intron–exon structure as judged by the positions of their splice junctions relative to the protein alignments of their reciprocal best-hit best HSPs.
As can be seen, the lengths of the inferred
The two distributions of orthologous intron pairs shown in
In general, the transposon load of the vertebrate introns is higher than that of the insects, and much of the central bulge is due to the presence of additional LINE elements in either the human or mouse member of the pair (unpublished data). This is in sharp distinction to the two insects. Although some of the larger off-diagonal intron pairs in the insect distribution (
Although transposons seem to explain the central bulge in the human–mouse distribution, they do not explain the details of the
The preceding observations imply that the rate at which intron pairs leave the diagonal in
No doubt, other less easily measured factors also affect the rate at which intron lengths evolve within a species. If transposon load and/or rates of transposition, for example, vary greatly within two genomes, correlations in the lengths of orthologous introns will be a poor indicator of time since last common ancestor. Rather than attempt to measure the impact of a host of factors on intron lengths, we chose instead to ask a related question. Namely, how constant is the decline in the correlation in orthologous intron lengths over time? Doing so allowed us to directly assess not only whether hypothetical differences in transposition rates actually do act to modify the rates at which length correlations among homologous
From left to right, and top to bottom: Annotated
Bottom right-hand panel: Annotated
Correlations in orthologous intron lengths seem to accord well with the passage of time (
The strong correlation between the intron and protein clocks demonstrated in
Our investigations of intron length evolution focused on discovering the forces driving changes in intron lengths; the rate at which they change, whether or not the rate is constant; and if so, over what duration and phylogenetic scope. Intron lengths vary greatly among the six annotated genomes, yet when placed in their phylogenetic context general trends emerge. Every deuterostome genome in our collection is characterized by a predominance of class-II (>100 nt) introns, whereas class-I (<100 nt) introns predominate in the protostome genomes. The similarity in the human and mouse distributions suggests that these distributions change slowly over periods of tens of millions of years. Our examinations of intron lengths within the
In order to further investigate the evolution of intron lengths, we used a transitive reciprocal best-hit strategy to assemble a dataset of genes we term quartets. Each quartet consists of four genes: a pair of human paralogs and their mouse orthologs. In theory, the orthologous members of each quartet share a more recent common ancestor than do the paralogous members of the quartet. The strong correlation in intron lengths characteristic of orthologous quartet members demonstrates that intron lengths within the vertebrates remain correlated for tens of millions of years following speciation events. Our comparisons of orthologous and paralogous intron lengths in the
To measure the rate at which intron lengths change, we examined them in the context of the protein clock. Our results show that correlations in the lengths of orthologous introns have declined at a constant rate within the
The intron and protein clocks complement one another in a number of ways. Rates of change among protein sequences are reasonably constant for any given set of orthologous genes across phyla but vary widely among different gene families. On the other hand, our results show that the speed of the intron clock may vary between phyla, but not between gene families within a genus. These facts mean that the intron clock is well suited for investigating the evolutionary history of gene families. To see why, consider that a collection of genes all having the same intron–exon structures and intron lengths are likely the result of recent duplication events, regardless of whether they encode rapidly or slowly evolving proteins.
Our analyses of gene structures demonstrate that change in intron–exon structures is subject to greater lineage-specific variation than is protein sequence evolution. The jagged right-hand side of
Despite the variability in their rate of evolution, the fact that genome-wide trends in intron–exon structures support the same phylogeny as proteome-wide trends in protein sequence similarities (
The large numbers of introns and low rate of intron insertion and deletion characteristic of animal genomes make it likely that intron density distributions are among the more slowly evolving traits of any animal genome. Consistent with this hypothesis, the
CGL can be downloaded from
The human, mouse, and
Reciprocal best-hit best HSPs were recovered from proteome-versus-proteome BLASTP searches using WU-BLAST [
The tree shown in
To produce
To extract orthologous introns from unannotated genomes, each annotated
All intron lengths are inferred, with the exception of
dana,
(2.1 MB PSD)
Pair-wise Spearman correlation coefficients were used as a similarity measure. Bootstraps were produced by randomly resampling intron pairs with replacement. All intron lengths are inferred, with the exception of
dana,
(39 KB PSD)
Orange line, best-fitting linear regression (y = 0.0457x + 0.0863; R2 = 0.0015). No significant Spearman correlation coefficient was observed for these data.
(2.6 MB PSD)
The authors would like to thank S. Mount, G. Marth, I. Korf, D. Shook, G. Miklos, and J. Stajich for providing constructive criticism of a draft of this manuscript; S. Shu and K. Eilbeck for database assistance; and W. Pearson for helpful suggestions regarding how to summarize large amounts of protein similarity data.
high-scoring segment pair
log odds ratio
nucleotides