Skip to main content
Advertisement
  • Loading metrics

Transcript diversity reflects deleterious RNA processing errors shaped by population size in metazoans

  • Kai Mi,

    Roles Formal analysis, Investigation, Methodology, Writing – original draft

    Affiliation Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China

  • Lili Guan,

    Roles Resources

    Affiliations Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China, Department of Otorhinolaryngology Head and Neck Surgery, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China

  • Bandhan Sarker,

    Roles Methodology

    Affiliation Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China

  • Siliang Song,

    Roles Methodology

    Affiliation Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America

  • Tianjiao Zhou,

    Roles Resources

    Affiliation Department of Otorhinolaryngology Head and Neck Surgery, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China

  • Hongliang Yi,

    Roles Resources

    Affiliation Department of Otorhinolaryngology Head and Neck Surgery, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China

  • Jianzhi Zhang ,

    Roles Conceptualization, Funding acquisition, Writing – review & editing

    jianzhi@umich.edu (JZ); chuanx@sjtu.edu.cn (CX)

    Affiliation Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, Michigan, United States of America

  • Chuan Xu

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Writing – original draft, Writing – review & editing

    jianzhi@umich.edu (JZ); chuanx@sjtu.edu.cn (CX)

    Affiliations Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai, China, Department of Otorhinolaryngology Head and Neck Surgery, Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China, Key Laboratory of Biodiversity and Environment on the Qinghai-Tibet Plateau, Ministry of Education, Lhasa, China

Abstract

In eukaryotes, alternative transcription initiation (ATI), alternative splicing (AS), and alternative polyadenylation (APA) result in multiple different transcripts per gene, but the biological significance of the transcript diversity produced remains controversial. Some suggested that this diversity is adaptive, while others contended that it is largely deleterious and arises from molecular errors in transcription and RNA processing. The error hypothesis makes a distinct prediction that is not expected under the adaptive hypothesis: transcript diversity declines with the effective population size (Ne) of the species because natural selection minimizing errors is more effective under larger Ne. By analyzing 166 transcriptomes from 75 metazoans, we report that transcript diversity measured by the percentage uses of minor ATI, AS, and APA sites decreases with Ne or its proxies. This observation supports the error hypothesis and suggests that metazoan transcript diversity is largely deleterious.

Introduction

In eukaryotes, alternative transcription initiation (ATI) [1] and alternative polyadenylation (APA) [2] can respectively vary the beginning and end of a transcript produced from a gene, while alternative splicing (AS) can generate different RNA isoforms by selective inclusion or exclusion of exons in mRNA processing [3]. As a result, multiple different RNA transcripts (isoforms) are often produced from a single eukaryotic gene [4,5], generating transcript diversity. These transcripts may vary in their coding sequence, untranslated regions, and/or other regulatory elements [13]. ATI, APA, and AS are common phenomena in various eukaryotes such as fungi [68], plants [911], and animals [1214]. For example, in humans, > 70% of genes exhibit APA [13], > 50% of genes display ATI [15], and >95% of multi-exon genes show AS [16], resulting in >170,000 transcripts recorded for ~20,000 human protein-coding genes (ENSEMBL genome reference consortium human build 38; GRCh38). ATI, APA, and AS may vary among tissues [1719], across developmental stages [1921], and during cell differentiation [2224], and can contribute to disease [1,25,26].

Despite the universality of ATI, APA, and AS in eukaryotes, the biological significance of the created transcript diversity is debated. Some case studies suggested that different RNA isoforms are functionally distinct. For instance, the human Lef1 gene [27], mouse Ighm gene [28], and Drosophila Sxl, Tra, and Dsx genes [29] have functionally distinct RNA isoforms (and corresponding protein isoforms) produced by ATI, APA, and AS, respectively. Such examples led to the hypothesis that transcript/proteome diversity is generally adaptive and that ATI, APA, and AS are widely used, regulated mechanisms to expand transcript/proteome diversity [3,4,30,31]. However, a competing hypothesis known as the error hypothesis has also been suggested [3238]. The error hypothesis contends that transcription and RNA processing are error-prone; consequently, the vast majority of the observed transcript diversity reflects molecular errors that not only lower the number of functional molecules and waste energy but may also create cytotoxicity [39]. Several lines of evidence support the error hypothesis [3236,39,40]. For example, the error hypothesis predicts that transcript diversity is lower in relatively highly expressed genes than in relatively lowly expressed genes because of stronger selection minimizing error rates acting on highly than lowly expressed genes [39]. Empirical data indeed support this prediction [39].

Nonetheless, the level of transcript diversity has not been extensively compared across species. Such a comparison is useful for differentiating between the adaptive and error hypotheses, because the two hypotheses make distinct predictions about the relationship between the level of transcript diversity in a species and the effective population size (Ne) of the species. Specifically, under the error hypothesis, transcript diversity is due to deleterious molecular error, so is disfavored and lowered by natural selection. Because the efficacy of natural selection increases with Ne, we expect transcript diversity to decline with Ne [41,42]. Under the adaptive hypothesis, however, transcript diversity is beneficial so may be selectively elevated. Under this scenario, transcript diversity is expected to increase with Ne. However, one could also argue that, under the adaptive hypothesis, the optimal level of transcript diversity in a species depends on the specific condition and environment of the species; as a result, no prediction can be made regarding the relationship between the level of transcript diversity and Ne. At any rate, a negative correlation between Ne and transcript diversity is predicted by the error hypothesis but is not expected under the adaptive hypothesis. Indeed, this prediction was validated by a comparative analysis of AS across 53 species [36].

In the present work, by quantifying ATI, APA, and AS in 166 transcriptomes, we compared levels of transcript diversity among 75 metazoan species spanning a wide range of Ne. We found that transcript diversity generally decreases with Ne or its proxies, strengthening the previous support of the error hypothesis for AS [36] and providing new evidence for this hypothesis for ATI and APA.

Results

Genomic and transcriptomic data

Based on two recent studies [36,43], we assembled a list of 100 diverse metazoan species with available genome sequences (Fig 1A). We collected publicly available genome, genomic annotation, and transcriptome data from these 100 species (Fig 1B and S1 Data). The transcriptomic data are the basis of transcript diversity estimation and they comprise Cap Analysis of Gene Expression Sequencing (CAGE-seq) data from seven species for direct measurement of ATI, 3′-end-seq data from 10 species for direct measurement of APA, and RNA-seq data from 75 species for detection of AS and prediction of ATI and APA (Fig 1B and S1 Data).

thumbnail
Fig 1. Species phylogeny, genomic and transcriptomic data, and Ne proxies used in the present study.

(A) Phylogenetic tree of the 100 metazoan species considered. (B) Available genome and transcriptome datasets of each species concerned, including CAGE-seq, 3′-end-seq, RNA-seq, coding genes, and BUSCO genes. (C) Ne and proxies. Spearman’s correlation and Phylogenetic Generalized Least Squares (PGLS) regression between Ne and life span (D), body length (E), and the nonsynonymous to synonymous substitution rate ratio ω computed using all BUSCO genes (F) and all one-to-one orthologous genes (G) across species. Each dot represents a species, colored according to its clade in (A). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g001

Because transcript diversity estimated from different tissues may not be comparable across species, we focused on three tissues (brain, ovary, and testis) best represented in the transcriptomic data, respectively covering 67, 51, and 48 species (Fig 1B and S1 Data). In interspecific comparisons, we initially analyzed all protein-coding genes in each species (Fig 1 and S1 Data). However, gene set variations among species could introduce a confounder in our comparison. We therefore focused on Benchmarking Universal Single-Copy Orthologs (BUSCO) genes in an additional comparison (Fig 1B and S1 Data), as was done in previous studies [36,43].

Ne and proxies

Of the 100 species, 26 have published Ne (Fig 1C and S2 Data). Given that most species lack published Ne, we resorted to three other parameters as Ne proxies: the body length, life span, and nonsynonymous to synonymous substitution rate ratio (ω). Because larger animals and longer-lived animals tend to have smaller Ne, body length and life span have been used as Ne proxies [36,44,45]. Under the nearly neutral theory and the neutral assumption of synonymous mutations, ω is expected to decline with Ne [41] so can also be a proxy for Ne. However, when synonymous mutations are frequently non-neutral as has been documented in some species [46], the validity of the above expectation is uncertain; we therefore examined it empirically (see below). We collected body length and life span data from previous studies [36,47] and estimated ω for each species (see Materials and methods and S3 Data). We correlated the three proxies with Ne and confirmed their negative correlations (Fig 1D1G), as reported previously [36]. Therefore, the error hypothesis predicts that transcript diversity should decline with Ne or increase with the three Ne proxies considered here. Note that throughout this study, we used two types of correlation analysis. The first is the simple rank correlation, whereas the second is Phylogenetic Generalized Least Squares (PGLS) regression, which controls for the phylogenetic relationships of the species considered in the data (see Materials and methods).

Interspecific variation in transcript diversity caused by APA

Accurate detection of APA typically relies on 3′-end RNA sequencing (i.e., 3′-end-seq), which can identify precise APA sites and their relative usages [48]. Several library preparation methods are available for 3′-end-seq [49], such as 3′READS [50], 3P-seq [51], and PAS-seq [52]. However, only a limited number of species have 3′-end-seq data [53]. We collected 3′-end-seq datasets from 10 species with Ne (S1 Data). For a given gene, let us refer to the most frequently used APA site as its major APA site, which is likely to be functionally the best APA site, and all other APA sites as its minor APA sites. We measured the transcript diversity of the gene caused by APA by the total percentage usage of its minor APA sites, which equals the total 3′-end reads of its minor APA sites divided by the total APA reads of the gene. We then averaged the transcript diversity across all genes considered in a species to represent the overall transcript diversity due to APA for the species. We found a negative correlation between transcript diversity and Ne across species, regardless of whether we considered all protein-coding genes (S1A Fig) or only BUSCO genes (Fig 2A). However, the negative correlations (or those from PGLS) were not significant, potentially due to the limited number of species in the analyses.

thumbnail
Fig 2. Transcript diversity caused by APA (measured by the mean total percentage usage of minor APA sites per gene) declines with Ne across species.

(A) Relationship between Ne and transcript diversity caused by APA estimated from 3′-end-seq. (B–E) Relationship between transcript diversity caused by APA estimated from RNA-seq and Ne (B), life span (C), body length (D), or ω (E) in the brain. P-values from Spearman’s correlation and PGLS are shown. N represents the number of species included. (F) Correlation between transcript diversity caused by APA and Ne (or proxies) in three tissues. BUSCO genes are used in all panels. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g002

To increase the number of species in the APA analysis, we predicted APA sites using RNA-seq data [48,54]. In particular, TAPAS leverages the Pruned Exact Linear Time algorithm, RNA-seq data, and gene structure information to predict APA sites and their abundances [55] and is known to outperform other tools [56]. We validated the performance of TAPAS by comparing the APA sites identified by 3′-end-seq with those predicted by TAPAS from RNA-seq using a dataset in which CD4 T cells were sequenced by both 3′-end-seq and RNA-seq [57]. We quantified APA site usage levels in both data types (see Materials and methods) and found a significant positive correlation between them (ρ > 0.27, P < 0.01; S1B Fig). We also computed the total percentage usage of minor APA sites for each gene in both data types and again observed a significant positive correlation between them (ρ > 0.16, P < 0.01; S1C Fig). To validate the applicability of APA site prediction by TAPAS in multiple species, we acquired RNA-seq datasets from the 10 species with 3′-end-seq data; while 3′-end-seq and RNA-seq were not generated from the same samples, they were from the same tissues in each of these species (S1 Data). We found that the average total percentage usage of minor APA sites per gene is significantly correlated between 3′-end-seq and RNA-seq data across species (ρ = 0.79, P = 9.8 × 10⁻³; S1D Fig). These results confirm the reliability of predicting APA sites from RNA-seq data.

We previously reported a negative correlation between the total percentage usage of minor APA sites of a gene and the gene expression level in each of five mammals studied, supporting the error hypothesis [32]. To assess whether this pattern extends beyond mammals, we repeated the analysis using APA sites predicted from RNA-seq data and found that the negative correlation persists in 163 of the 166 samples analyzed (S1E Fig), suggesting that this pattern is generally true across animals.

Next, we used RNA-seq data to estimate the average total percentage usage of minor APA sites per BUSCO gene in each of 75 species. We started with the brain because the number of species with RNA-seq data from this tissue is the highest. Across species, the above estimate of species transcript diversity reduces with Ne (ρ = −0.73, P = 4.3 × 10−4; PPGLS = 1.8 × 10−2; Fig 2B), but increases with life span (ρ = 0.61, P = 9 × 10−5; PPGLS = 4.2 × 10−2; Fig 2C), body length (ρ = 0.70, P = 2.4 × 10−6; PPGLS = 4.4 × 10−7; Fig 2D), and ω (ρ = 0.63, P = 8.6 × 10−9; PPGLS = 3.5 × 10−3; Fig 2E). Similar results were observed for the ovary and testis (Fig 2F). When the above transcript diversity in a species was calculated using all protein-coding genes, the patterns remain qualitatively unchanged (S1F Fig). Hence, patterns of interspecific variation in transcript diversity caused by APA support the error hypothesis.

Interspecific variation in transcript diversity caused by ATI

Although ATI is ideally assessed by CAGE-seq data [58], such data are available for only several species in our collection (S1 Data). Because ATI can be inferred from RNA-seq data [59,60], we chose to use RNA-seq to compare ATI across species to allow the inclusion of a broader range of species in our analysis. We used SEASTAR, a method known to outperform other tools [60], to predict ATI sites. To validate the SEASTAR prediction, we collected a set of samples sequenced by both CAGE-seq and RNA-seq [61] and, respectively, identified ATI sites from CAGE-seq and from RNA-seq using SEASTAR for all protein-coding genes. Analogous to the APA analysis, we quantified the expression level for each ATI site, the total percentage usage of minor ATI sites for each gene, and the average total percentage usage of minor ATI sites per gene in a species (see Materials and methods), and found them to respectively exhibit a significant, positive correlation between estimates from CAGE-seq and those from RNA-seq (ρ > 0.32, P ≤ 0.02; S2AS2C Fig). These findings support the reliability of predicting ATI site usage from RNA-seq data.

Our previous study in humans and mice showed that the total percentage usage of minor ATI sites of a gene declines with the gene expression level, supporting the error hypothesis of ATI [33]. In the three tissues of every species investigated in the present study, the above trend is observed (S2D Fig).

Next, we calculated the average total percentage usage of minor ATI sites per gene for the BUSCO genes of a species using RNA-seq data and correlated it with Ne across species. Indeed, a significant, negative correlation was observed across 19 species (ρ = −0.74, P = 2.7 × 10−4; PPGLS = 6.8 × 10−3; Fig 3A). We similarly observed a positive correlation when Ne is replaced with life span (ρ = 0.66, P = 1.3 × 10−5; PPGLS = 3.1 × 10−2; Fig 3B), body length (ρ = 0.78, P = 2.5 × 10−8; PPGLS = 1.8 × 10−9; Fig 3C), or ω (ρ = 0.49, P = 2.9 × 10−5; PPGLS = 1.7 × 10−7; Fig 3D). Similar results were obtained for the testis and ovary for BUSCO genes (Fig 2E) or all protein-coding genes (S2E Fig).

thumbnail
Fig 3. Transcript diversity caused by ATI (measured by the mean total percentage usage of minor ATI sites per gene from RNA-seq) declines with Ne across species.

(A–D) Relationship between transcript diversity caused by APA and Ne (A), life span (B), body length (C), or ω (D) in the brain. P-values from Spearman’s correlation and PGLS are shown. N represents the number of species included. (E) Correlation between transcript diversity caused by ATI and Ne (or proxies) in three tissues. BUSCO genes are used in all panels. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g003

Interspecific variation in transcript diversity caused by AS

Previous studies have shown that AS is noisy [62] and mostly nonadaptive [34], consistent with the error hypothesis. To compare transcript diversity caused by AS across species, we assembled the transcripts for a gene and quantified the expression level of each transcript of the gene using StringTie, a widely-used tool outperforming others in the accuracy of both assembly and expression level measurement [63]. We then computed for each gene its transcript diversity due to AS by dividing the total splicing amount of all minor RNA splicing isoforms by the total splicing amount of all RNA splicing isoforms of the gene. Here, the splicing amount of an RNA splicing isoform is the total number of reads covering all splicing junctions of the isoform.

The error hypothesis predicts that the transcript diversity of a gene caused by AS should decrease with the expression level of the gene, because more highly expressed genes are subject to stronger selection against splicing error [64]. Indeed, we observed negative correlations in 158 of 166 samples (S3A Fig). These results confirm the previous findings from a limited number of species [34] and suggest that the error hypothesis of AS is broadly supported in animals.

Next, we calculated the mean transcript diversity (caused by AS) per gene for each species among BUSCO genes using data from the brain tissue. We found that this quantity significantly decreases with Ne (Fig 4A), but increases with life span (Fig 4B), body length (Fig 4C), and ω (Fig 4D). Similar results were observed for the ovary and testis (Fig 4E). These patterns remain qualitatively unchanged (S3B Fig) when all protein-coding genes were analyzed. Thus, consistent with a previous study [36], our across-species comparison of AS supports the error hypothesis.

thumbnail
Fig 4. Transcript diversity caused by AS (measured by the mean total percentage usage of splicing junctions in minor splicing isoforms per gene) declines with Ne across species.

(A–D) Relationship between transcript diversity caused by AS and Ne (A), life span (B), body length (C), or ω (D) in the brain. P-values from Spearman’s correlation and PGLS are shown. N represents the number of species included. (E) Correlation between transcript diversity caused by AS and Ne (or proxies) in three tissues. BUSCO genes are used in all panels. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g004

Discussion

Transcript diversity primarily arises from ATI, APA, and AS. Although past studies have provided substantial genomic evidence for the error hypothesis of ATI [33], APA [32,65], and AS [34], these studies focused on a small number of species. As a result of this limitation and an increasing number of reports of cases of functional ATI, APA, or AS, the general biological significance of transcript diversity remains controversial. In the present study, we expanded the analysis to 75 species and showed that the previous finding from a small number of species generally hold across animals. More importantly, we found that the transcript diversity of a species declines with the species’ Ne or its proxies, as predicted by the error hypothesis.

A central theoretical underpinning of the non-adaptive paradigm relevant to our results is the drift-barrier model, which was first proposed by Lynch [66] to explain the mutation rate variation across species. This model predicts that the efficacy of selection is limited due to genetic drift and mutation bias, such that phenotypic traits of species with smaller Ne, where drift is more potent, are less optimized than those of species with larger Ne [42]. For transcript diversity, the drift-barrier model predicts that errors in transcriptional and post-transcriptional processing (e.g., incorrect AS, imprecise polyadenylation, or aberrant transcription initiation) that generate non-functional or weakly deleterious transcript isoforms will persist at higher frequencies in species with smaller Ne. Our observation that transcript diversity declines with Ne (and its proxies) confirms this prediction and hence supports the drift-barrier model.

Comparative studies across species often encounter confounding factors that could bias the outcome. For instance, estimating transcript diversity in this study relies on transcript annotations, which vary in completeness across species, with model or well-studied organisms typically having more comprehensive annotations, potentially introducing an interspecific bias. To minimize this bias, we performed de novo transcript assembly for each species using RNA-seq data, ensuring uniform annotation processes and quality (see Materials and methods). Although our approach may reduce annotation quality for model organisms, it mitigates the potential interspecific bias. Indeed, the observed patterns are unaltered by including (Figs 2E, 3D, and 4D) or excluding (S4 Fig) model species. Similarly, genome size and complexity can influence genome annotations and transcript diversity assessments. To address this issue, we employed BUSCO genes. These genes are single-copy highly conserved orthologs that are unaffected by genome size or complexity across species, ensuring comparability among taxa. Indeed, when using BUSCO genes to compute transcript diversity, we found that our conclusions hold regardless of whether genome size and complexity are controlled or not (S5 Fig). Thus, our results are robust to the above potential confounding factors in multispecies comparisons.

As mentioned, Benitiere and colleagues (2024) also reported a negative correlation between Ne and transcript diversity caused by AS across 53 species [36]. Nevertheless, our methodology differs from Benitiere and colleagues’s in several aspects. First, we calculated transcript diversity at the gene level, whereas Benitiere and colleagues measured it at the intron level. Second, Benitiere and colleagues combined multiple RNA-seq datasets to detect splicing events, which improved splicing event detection but introduced heterogeneity among tissues. By contrast, we compared the same tissue across species, which made the interspecific comparison fairer. Third, Benitiere and colleagues analyzed 53 species, most being insects, while our analysis encompassed 75 animals with a broader phylogenetic sampling. Fourth, in addition to the three Ne proxies used by Benitiere and colleagues in the correlation analysis, our study also used Ne from 26 species. Notwithstanding these methodological differences, the findings of the two studies are consistent.

In the debate about the biological significance of AS, several authors noted a significant positive correlation between the amount of AS of a species and its organismal complexity measured by the number of cell types [67,68]. Chen and colleagues [68] reported that this correlation remains even after the control for the species’ Ne, suggesting that AS is adaptive and is at least partially responsible for organismal complexity. Their analysis was recently criticized by Benitiere and colleagues [36] for using nucleotide diversity at synonymous sites (πS) as a proxy for Ne. Synonymous mutations are often non-neutral [46], but even when they are neutral, πS is determined by both Ne and the mutation rate per site per generation, the latter of which varies across species [42]. Hence, πS is not an appropriate proxy for Ne. While Benitiere and colleagues suggested that ω would be a more appropriate proxy for Ne, they did not perform the actual partial correlation analysis. We therefore investigated the relationships among transcript diversity caused by AS, number of cell types, and ω across 12 species for which all three estimates are available in our brain tissue dataset (S4 Data). Consistent with the finding of Chen and colleagues [68], we observed a positive correlation between the transcript diversity caused by AS and the number of cell types across species even after the control for ω, but this partial correlation did not reach statistical significance in the PGLS analysis (S6 Fig). That is, after the control for the phylogenetic relationships in the data and ω (as a proxy for Ne), there is no significant partial correlation between organismal complexity measured by the number of cell types and transcript diversity caused by AS. We note that, even if the above partial correlation is significant, it does not mean that AS underlies organismal complexity. This is because, to demonstrate that AS contributes to organismal complexity, one needs to show that AS varies among cell types and plays a role in the functional diversity among cell types, which will be an interesting direction to pursue in the future when AS can be reliably assessed from single-cell RNA-seq data.

It is important to note that beyond APA, ATI, and AS, there are other variations in gene expression that result in gene product diversity, including post-transcriptional modifications (e.g., RNA editing [69] and m5C modifications [37]), translation variations (e.g., alternative translation initiation [70], mistranslation [71], and stop-codon read-through [72]), and post-translational modifications (e.g., phosphorylation). Notably, a recent study of the mis-transcription rate reveals a narrow range of variation across the tree of life [73]. Hence, it would be highly valuable to conduct cross-species comparisons as performed here for additional types of transcript diversity when appropriate data become available from a sufficient number of species.

Materials and methods

Genomes, transcriptomes, and Ne estimates

The species used in this study were sourced from Zhang and colleagues and Bénitière and colleagues [36,43]. The Ne estimates were acquired from the references listed in S2 Data, where Ne for the majority of species was inferred from measures of presumably neutral polymorphism (π) and germline mutation rate (µ), under the assumption of mutation-drift equilibrium, following the relationship Ne = π/(4µ) for diploid species. Data of body length, life span, and number of cell types were obtained from previous studies [36,47]. Reference genomes, protein sequences, cDNA sequences, and RNA-seq data were downloaded from the ENSEMBL (release 100) [74] and NCBI [75]. Accession numbers of RNA-seq samples are listed in S1 Data.

Phylogenetic tree

We obtained the phylogenetic tree of the species concerned from the Open Tree of Life [76]. We used BUSCO to identify single-copy protein-coding orthologous genes conserved in all 100 species, concatenated the protein sequences of the single-copy orthologs in each species, and performed multiple sequence alignments using MUSCLE [77] with the default parameters. Poorly aligned regions in the resulting alignment were removed using trimAl [78] with the “-automated1” method. The branch lengths of the tree were calculated using codeml from PAML with the JTT substitution model (“seqtype = 2, runmode = 0, model = 2, aaRateFile = jones.dat”).

ω estimation

For each focal species, we selected a triplet consisting of three species to estimate its ω value. Specifically, the triplet included the focal species, a closely related ingroup species, and an outgroup species. For instance, the ω values of human and chimpanzee were estimated using the triplet ((human, chimpanzee), gorilla), whereas the ω value for gorilla was estimated using ((human, gorilla), orangutan). For each triplet, we identified 1:1:1 orthologous protein-coding genes among the three species using OrthoFinder [79]. These orthologous genes were concatenated in the same order to construct a supergene. Multiple sequence alignments of the supergene protein sequences were performed using MUSCLE [77] with default parameters, and codon alignments were generated using TranAlign from the EMBOSS package [80]. Finally, we estimated the ω value of the supergene using the branch model (model = 1) implemented in codeml from the PAML package [81]. The resulting ω value of the supergene was taken as the ω value for that species. When focusing on BUSCO genes, we used only the BUSCO orthologs in estimating ω.

RNA-seq data processing

We utilized fastp [82] to perform quality control of RNA-seq raw reads and then aligned high-quality reads to the corresponding reference genomes using STAR 2.7.10a [83]. Read counts of a gene in each sample were obtained by featureCounts [84] based on the read alignment generated by STAR 2.7.10a. The total exon length of a gene, calculated using a custom script, is defined as the effective length of the gene. Finally, the expression level of a gene is calculated by Transcripts Per Million (TPM) [85]. To mitigate potential biases arising from differences in genome annotation across species, we used StringTie [63] to reassemble gene transcripts from RNA-seq data as reference RNA isoforms.

APA data processing

3′-end-seq data were downloaded from NCBI (S1 Data). Raw reads were inspected using FastQC and adapters were removed by Cutadapt. Next, clean reads were aligned to the corresponding reference genomes using Bowtie2 [86] with default parameters. Uniquely aligned reads were processed to define APA sites. APA sites supported by fewer than two 3′-end-seq reads were excluded from further analysis, and the remaining sites located within 30 base pairs of one another were merged into a single cluster. The APA site with the highest read count within each cluster represented the cluster, and the total number of 3′-end-seq reads mapped to all APA sites within a cluster was considered the expression abundance of the APA site. An APA site was assigned to a gene if it was mapped within the region spanning from the 5′-end of the gene to 1,000 bp downstream of the 3′-end of the gene. APA sites mapped to multiple genes were excluded from further analysis. The expression level of an APA site was calculated by dividing the total number of reads mapped to the site by the total number of reads from the corresponding library that were successfully mapped to the genome. The TAPAS [55] pipeline with default parameters was used to predict APA sites. For APA sites identified by 3′-end-seq or predicted from RNA-seq, we consider the APA site with the highest expression level in a gene to be the major APA site of the gene, while other APA sites were considered minor APA sites. The total percentage usage of minor APA sties in a gene was calculated by dividing the total expression level of minor APA sites by the total expression level of all APA sites of the gene.

ATI data processing

CAGE-seq data were downloaded from FANTOM5 [87] and NCBI (S1 Data). Raw CAGE-seq reads were inspected using fastQC and adapters were removed using Cutadapt. rRNAdust was then used to filter out rRNA reads. Next, the cleaned CAGE-seq reads were aligned to the genome using HISAT2 [88] with the default setting. Uniquely aligned reads were processed with the CAGEr [89] package and converted to quantified CAGE transcriptional start site (CTSS) coordinates. CTSSs supported by fewer than two CAGE-seq reads were excluded from further analysis. Remaining CTSSs located within 30 base pairs of one another were merged into a single cluster. The CTSS with the highest read count within each cluster was designated as the representative position of the cluster, referred to as an ATI site. The total number of CAGE-seq reads mapped to all CTSSs within a cluster was defined as the expression abundance of the corresponding ATI site. An ATI site was assigned to a gene if it was mapped within the region spanning from 1,000 bp upstream of the 5′-end of the gene to the 3′-end of the gene. ATI sites mapped to multiple genes were excluded from further analysis. The expression level of an ATI site was calculated by dividing the total number of reads mapped to the site by the total number of reads from the corresponding library that were successfully mapped to the genome. The SEASTAR [60] pipeline with default parameters was used to predict ATI sites from RNA-seq data. ATI sites with the first exon coverage of less than two reads were excluded. The expression level of an ATI site was calculated by dividing its first exon coverage by the product of the length of the exon and the total number of reads mapped to the genome in the library. For ATI sites identified by CAGE-seq or predicted from RNA-seq, the ATI site with the highest expression level for a gene was considered the major ATI site of the gene, while all other ATI sites were considered minor sites. The total percentage usage of minor ATI sties of a gene was computed by dividing the total expression level of minor ATI sites by the total expression level of all ATI sites of the gene.

AS data processing

The splicing junctions and expression level of each RNA isoform in a gene were respectively assembled and measured using StringTie [63] after STAR alignment. The isoform with the highest TPM in a gene was regarded as the major RNA transcript of the gene, while other isoforms were considered minor RNA transcripts. For junctions that are shared between multiple transcripts, we reassigned junction reads to these transcripts based on their expression levels. The total percentage usage of junctions in minor RNA splicing isoforms was calculated by dividing the splicing amount in minor RNA splicing isoforms of a gene by the total splicing amount of the gene.

Data analysis

Data analysis was conducted using R statistical software (v4.2). To take phylogenetic inertia into account, we performed PGLS regression [36] by R package “caper” when conducting cross-species correlations.

Supporting information

S1 Fig. Analysis of transcript diversity caused by APA (measured by total percentage usage of minor APA sites) using 3′-end-seq and RNA-seq.

(A) Correlation between Ne and transcript diversity caused by APA estimated from 3′-end-seq across seven metazoans. All protein-coding genes are used in the analysis. (B) Spearman’s correlation between the expression level of an APA site quantified by 3′-end-seq and that predicted by RNA-seq. (C) Spearman’s correlation between the total percentage usage of minor APA sites in a gene quantified by 3′-end-seq and that predicted by RNA-seq. The X-axes in (B) and (C) show the SRA accession numbers of RNA-seq and 3′-end-seq data from the same sample. (D) Spearman’s correlation between transcript diversity caused by APA in a species qualified by 3′-end-seq and that predicted by RNA-seq. (E) Spearman’s correlation between the gene expression level and the total percentage usage of minor APA sites across genes in each of three tissues in each of 75 species. Each row represents a species. (F) Correlation between transcript diversity caused by APA and Ne, life span, body length, or ω across species in three tissues. All protein-coding genes are used in (F). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s001

(TIF)

S2 Fig. Analysis of transcript diversity caused by ATI (measured by the total percentage usage of minor ATI sites) using CAGE-seq and RNA-seq.

(A) Spearman’s correlation between the expression level of an ATI site quantified by CAGE-seq and that predicted by RNA-seq. (B) Spearman’s correlation between the total percentage usage of minor ATI sites in a gene quantified by CAGE-seq and that predicted by RNA-seq. The X-axes in (A) and (B) show the SRA accession numbers of RNA-seq and CAGE-seq data from the same sample. (C) Spearman’s correlation between transcript diversity caused by ATI in a species quantified by CAGE-seq and that predicted by RNA-seq. Each dot is a species. (D) Spearman’s correlation between the gene expression level and the total percentage usage of minor ATI sites across genes in each of three tissues in each of 75 species. Each row represents a species. (E) Correlation between transcript diversity caused by ATI and Ne, life span, body length, or ω across species in three tissues. All protein-coding genes are used in (E). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s002

(TIF)

S3 Fig. Analysis of transcript diversity caused by AS (measured by the total percentage usage of splicing junctions in minor RNA splicing isoforms).

(A) Spearman’s correlation between the gene expression level and the total percentage usage of splicing junctions in minor RNA splicing isoforms across genes in each of three tissues in each of 75 species. Each row represents a species. (B) Correlation between transcript diversity caused by ATI and Ne, life span, body length, or ω across species in three tissues. All protein-coding genes are used in (B). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s003

(TIF)

S4 Fig. Correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after removing 9 model species (Homo sapiens, Macaca mulatta, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, Drosophila melanogaster, Aedes aegypti, and Caenorhabditis elegans).

The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s004

(TIF)

S5 Fig. Partial correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after the control for genome size and complexity.

Dots represent the raw data. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s005

(TIF)

S6 Fig. Partial correlation between transcript diversity caused by AS and the number of cell types after the control for ω.

Dots represent the raw data. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s006

(TIF)

S1 Data. Available datasets of transcriptomes and genomes for the species investigated.

https://doi.org/10.1371/journal.pbio.3003671.s007

(XLSX)

S2 Data. Ne of 26 metazoans reported in previous studies.

https://doi.org/10.1371/journal.pbio.3003671.s008

(DOCX)

S4 Data. Twelve species with available information about the number of cell types.

https://doi.org/10.1371/journal.pbio.3003671.s010

(XLSX)

Acknowledgments

We thank Michael Lynch for valuable comments

References

  1. 1. Alfonso-Gonzalez C, Hilgers V. (Alternative) transcription start sites as regulators of RNA processing. Trends Cell Biol. 2024;34(12):1018–28. pmid:38531762
  2. 2. Elkon R, Ugalde AP, Agami R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet. 2013;14(7):496–506. pmid:23774734
  3. 3. Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet. 2010;11(5):345–55. pmid:20376054
  4. 4. de Klerk E, ’t Hoen PAC. Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Genet. 2015;31(3):128–39. pmid:25648499
  5. 5. Licatalosi DD, Darnell RB. RNA processing and its regulation: global insights into biological networks. Nat Rev Genet. 2010;11(1):75–87. pmid:20019688
  6. 6. Dang TTV, Colin J, Janbon G. Alternative transcription start site usage and functional implications in pathogenic fungi. J Fungi (Basel). 2022;8(10):1044. pmid:36294609
  7. 7. Fang S, Hou X, Qiu K, He R, Feng X, Liang X. The occurrence and function of alternative splicing in fungi. Fungal Biol Rev. 2020;34(4):178–88.
  8. 8. Liu X, Hoque M, Larochelle M, Lemay J-F, Yurko N, Manley JL, et al. Comparative analysis of alternative polyadenylation in S. cerevisiae and S. pombe. Genome Res. 2017;27(10):1685–95. pmid:28916539
  9. 9. Xing D, Li QQ. Alternative polyadenylation and gene expression regulation in plants. Wiley Interdiscip Rev RNA. 2011;2(3):445–58. pmid:21957029
  10. 10. Syed NH, Kalyna M, Marquez Y, Barta A, Brown JWS. Alternative splicing in plants – coming of age. Trends Plant Sci. 2012;17(10):616–23. pmid:22743067
  11. 11. Le NT, Harukawa Y, Miura S, Boer D, Kawabe A, Saze H. Epigenetic regulation of spurious transcription initiation in Arabidopsis. Nat Commun. 2020;11(1):3224. pmid:32591528
  12. 12. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, et al. A promoter-level mammalian expression atlas. Nature. 2014;507(7493):462–70. pmid:24670764
  13. 13. Derti A, Garrett-Engele P, Macisaac KD, Stevens RC, Sriram S, Chen R, et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 2012;22(6):1173–83. pmid:22454233
  14. 14. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, et al. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338(6114):1587–93. pmid:23258890
  15. 15. Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, et al. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. 2006;16(1):55–65. pmid:16344560
  16. 16. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40(12):1413–5. pmid:18978789
  17. 17. Shephard EA, Chandan P, Stevanovic-Walker M, Edwards M, Phillips IR. Alternative promoters and repetitive DNA elements define the species-dependent tissue-specific expression of the FMO1 genes of human and mouse. Biochem J. 2007;406(3):491–9. pmid:17547558
  18. 18. Lianoglou S, Garg V, Yang JL, Leslie CS, Mayr C. Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 2013;27(21):2380–96. pmid:24145798
  19. 19. Baralle FE, Giudice J. Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol. 2017;18(7):437–51. pmid:28488700
  20. 20. Davis W Jr, Schultz RM. Developmental change in TATA-box utilization during preimplantation mouse development. Dev Biol. 2000;218(2):275–83. pmid:10656769
  21. 21. Ulitsky I, Shkumatava A, Jan CH, Subtelny AO, Koppstein D, Bell GW, et al. Extensive alternative polyadenylation during zebrafish development. Genome Res. 2012;22(10):2054–66. pmid:22722342
  22. 22. Pozner A, Lotem J, Xiao C, Goldenberg D, Brenner O, Negreanu V, et al. Developmentally regulated promoter-switch transcriptionally controls Runx1 function during embryonic hematopoiesis. BMC Dev Biol. 2007;7:84. pmid:17626615
  23. 23. Cheng LC, Zheng D, Baljinnyam E, Sun F, Ogami K, Yeung PL, et al. Widespread transcript shortening through alternative polyadenylation in secretory cell differentiation. Nat Commun. 2020;11(1):3182. pmid:32576858
  24. 24. Fiszbein A, Kornblihtt AR. Alternative splicing switches: important players in cell differentiation. Bioessays. 2017;39(6):10.1002/bies.201600157. pmid:28452057
  25. 25. Tazi J, Bakkour N, Stamm S. Alternative splicing and disease. Biochim Biophys Acta. 2009;1792(1):14–26. pmid:18992329
  26. 26. Gruber AJ, Zavolan M. Alternative cleavage and polyadenylation in health and disease. Nat Rev Genet. 2019;20(10):599–614. pmid:31267064
  27. 27. Arce L, Yokoyama NN, Waterman ML. Diversity of LEF/TCF action in development and disease. Oncogene. 2006;25(57):7492–504. pmid:17143293
  28. 28. Peterson ML. Mechanisms controlling production of membrane and secreted immunoglobulin during B cell development. Immunol Res. 2007;37(1):33–46. pmid:17496345
  29. 29. Salz HK, Erickson JW. Sex determination in Drosophila: the view from the top. Fly (Austin). 2010;4(1):60–70. pmid:20160499
  30. 30. Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH-M. The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 2008;24(4):167–77. pmid:18329129
  31. 31. Mayr C. Evolution and biological roles of alternative 3’UTRs. Trends Cell Biol. 2016;26(3):227–37. pmid:26597575
  32. 32. Xu C, Zhang J. Alternative polyadenylation of mammalian transcripts is generally deleterious, not adaptive. Cell Syst. 2018;6(6):734-742.e4. pmid:29886108
  33. 33. Xu C, Park J-K, Zhang J. Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biol. 2019;17(3):e3000197. pmid:30883542
  34. 34. Saudemont B, Popa A, Parmley JL, Rocher V, Blugeon C, Necsulea A, et al. The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome Biol. 2017;18(1):208. pmid:29084568
  35. 35. Xu C, Zhang J. Mammalian circular RNAs result largely from splicing errors. Cell Rep. 2021;36(4):109439. pmid:34320353
  36. 36. Bénitière F, Necsulea A, Duret L. Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. Elife. 2024;13:RP93629. pmid:38470242
  37. 37. Li Z, Mi K, Xu C. Most m5C Modifications in Mammalian mRNAs are Nonadaptive. Mol Biol Evol. 2025;42(1):msaf008. pmid:39824217
  38. 38. Li Z, Sarker B, Zhao F, Zhou T, Zhang J, Xu C. COL: a method for identifying putatively functional circular RNAs. J Genet Genomics. 2024;51(11):1338–41. pmid:39218058
  39. 39. Zhang J, Xu C. Gene product diversity: adaptive or not?. Trends Genet. 2022;38(11):1112–22. pmid:35641344
  40. 40. Melamud E, Moult J. Stochastic noise in splicing machinery. Nucleic Acids Res. 2009;37(14):4873–86. pmid:19546110
  41. 41. Ohta T. The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 1992;23(1):263–86.
  42. 42. Lynch M, Ackerman MS, Gout J-F, Long H, Sung W, Thomas WK, et al. Genetic drift, selection and the evolution of the mutation rate. Nat Rev Genet. 2016;17(11):704–14. pmid:27739533
  43. 43. Zhang H, Wang Y, Wu X, Tang X, Wu C, Lu J. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat Commun. 2021;12(1):1076. pmid:33597535
  44. 44. Figuet E, Nabholz B, Bonneau M, Mas Carrio E, Nadachowska-Brzyska K, Ellegren H, et al. Life history traits, protein evolution, and the nearly neutral theory in amniotes. Mol Biol Evol. 2016;33(6):1517–27. pmid:26944704
  45. 45. Waples RS. Life-history traits and effective population size in species with overlapping generations revisited: the importance of adult mortality. Heredity (Edinb). 2016;117(4):241–50. pmid:27273324
  46. 46. Zhang J, Qian W. Functional synonymous mutations and their evolutionary consequences. Nat Rev Genet. 2025;26(11):789–804. pmid:40394196
  47. 47. Alvarez-Ponce D, Krishnamurthy S. Organismal complexity strongly correlates with the number of protein families and domains. Proc Natl Acad Sci U S A. 2025;122(5):e2404332122. pmid:39874285
  48. 48. Chen W, Jia Q, Song Y, Fu H, Wei G, Ni T. Alternative polyadenylation: methods, findings, and impacts. Genomics Proteomics Bioinformatics. 2017;15(5):287–300. pmid:29031844
  49. 49. Wu G, Schmid M, Jensen TH. 3’ End sequencing of pA+ and pA- RNAs. Methods Enzymol. 2021;655:139–64. pmid:34183119
  50. 50. Hoque M, Ji Z, Zheng D, Luo W, Li W, You B, et al. Analysis of alternative cleavage and polyadenylation by 3’ region extraction and deep sequencing. Nat Methods. 2013;10(2):133–9. pmid:23241633
  51. 51. Jan CH, Friedman RC, Ruby JG, Bartel DP. Formation, regulation and evolution of Caenorhabditis elegans 3’UTRs. Nature. 2011;469(7328):97–101. pmid:21085120
  52. 52. Shepard PJ, Choi E-A, Lu J, Flanagan LA, Hertel KJ, Shi Y. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA. 2011;17(4):761–72. pmid:21343387
  53. 53. Herrmann CJ, Schmidt R, Kanitz A, Artimo P, Gruber AJ, Zavolan M. PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3’ end sequencing. Nucleic Acids Res. 2020;48(D1):D174–9. pmid:31617559
  54. 54. Ye W, Lian Q, Ye C, Wu X. A survey on methods for predicting polyadenylation sites from DNA sequences, bulk RNA-seq, and single-cell RNA-seq. Genomics Proteomics Bioinformatics. 2023;21(1):67–83. pmid:36167284
  55. 55. Arefeen A, Liu J, Xiao X, Jiang T. TAPAS: tool for alternative polyadenylation site analysis. Bioinformatics. 2018;34(15):2521–9. pmid:30052912
  56. 56. Chen M, Ji G, Fu H, Lin Q, Ye C, Ye W, et al. A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data. Brief Bioinform. 2020;21(4):1261–76. pmid:31267126
  57. 57. Tang P, Yang Y, Li G, Huang L, Wen M, Ruan W, et al. Alternative polyadenylation by sequential activation of distal and proximal PolyA sites. Nat Struct Mol Biol. 2022;29(1):21–31. pmid:35013598
  58. 58. Adiconis X, Haber AL, Simmons SK, Levy Moonshine A, Ji Z, Busby MA, et al. Comprehensive comparative analysis of 5’-end RNA-sequencing methods. Nat Methods. 2018;15(7):505–11. pmid:29867192
  59. 59. Cass AA, Xiao X. mountainClimber Identifies Alternative Transcription Start and Polyadenylation Sites in RNA-Seq. Cell Syst. 2019;9(4):393-400.e6. pmid:31542416
  60. 60. Qin Z, Stoilov P, Zhang X, Xing Y. SEASTAR: systematic evaluation of alternative transcription start sites in RNA. Nucleic Acids Res. 2018;46(8):e45. pmid:29546410
  61. 61. Gacita AM, Dellefave-Castillo L, Page PGT, Barefield DY, Wasserstrom JA, Puckelwartz MJ, et al. Altered enhancer and promoter usage leads to differential gene expression in the normal and failed human heart. Circ Heart Fail. 2020;13(10):e006926. pmid:32993371
  62. 62. Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 2010;6(12):e1001236. pmid:21151575
  63. 63. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5. pmid:25690850
  64. 64. Zhang J, Yang J-R. Determinants of the rate of protein sequence evolution. Nat Rev Genet. 2015;16(7):409–20. pmid:26055156
  65. 65. Xu C, Zhang J. A different perspective on alternative cleavage and polyadenylation. Nat Rev Genet. 2020;21(1):63. pmid:31745293
  66. 66. Lynch M. Evolution of the mutation rate. Trends Genet. 2010;26(8):345–52. pmid:20594608
  67. 67. Schad E, Tompa P, Hegyi H. The relationship between proteome size, structural disorder and organism complexity. Genome Biol. 2011;12(12):R120. pmid:22182830
  68. 68. Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Mol Biol Evol. 2014;31(6):1402–13. pmid:24682283
  69. 69. Xu G, Zhang J. Human coding RNA editing is generally nonadaptive. Proc Natl Acad Sci U S A. 2014;111(10):3769–74. pmid:24567376
  70. 70. Xu C, Zhang J. Mammalian alternative translation initiation is mostly nonadaptive. Mol Biol Evol. 2020;37(7):2015–28. pmid:32145028
  71. 71. Sun M, Zhang J. Preferred synonymous codons are translated more accurately: Proteomic evidence, among-species variation, and mechanistic basis. Sci Adv. 2022;8(27):eabl9812. pmid:35857447
  72. 72. Li C, Zhang J. Stop-codon read-through arises largely from molecular errors and is generally nonadaptive. PLoS Genet. 2019;15(5):e1008141. pmid:31120886
  73. 73. Li W, Baehr S, Marasco M, Reyes L, Brister D, Pikaard CS, et al. A narrow range of transcript-error rates across the Tree of Life. Sci Adv. 2025;11(28):eadv9898. pmid:40644547
  74. 74. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50(D1):D988–95. pmid:34791404
  75. 75. Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021;49(D1):D10–7. pmid:33095870
  76. 76. Hinchliff CE, Smith SA, Allman JF, Burleigh JG, Chaudhary R, Coghill LM, et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc Natl Acad Sci U S A. 2015;112(41):12764–9. pmid:26385966
  77. 77. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. pmid:15034147
  78. 78. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3. pmid:19505945
  79. 79. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16(1):157. pmid:26243257
  80. 80. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16(6):276–7. pmid:10827456
  81. 81. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91. pmid:17483113
  82. 82. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884–90. pmid:30423086
  83. 83. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886
  84. 84. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30. pmid:24227677
  85. 85. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5. pmid:22872506
  86. 86. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. pmid:22388286
  87. 87. Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015;16(1):22. pmid:25723102
  88. 88. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60. pmid:25751142
  89. 89. Haberle V, Forrest ARR, Hayashizaki Y, Carninci P, Lenhard B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015;43(8):e51. pmid:25653163