Transcript diversity reflects deleterious RNA processing errors shaped by population size in metazoans

Kai Mi; Lili Guan; Bandhan Sarker; Siliang Song; Tianjiao Zhou; Hongliang Yi; Jianzhi Zhang; Chuan Xu

doi:10.1371/journal.pbio.3003671

Abstract

In eukaryotes, alternative transcription initiation (ATI), alternative splicing (AS), and alternative polyadenylation (APA) result in multiple different transcripts per gene, but the biological significance of the transcript diversity produced remains controversial. Some suggested that this diversity is adaptive, while others contended that it is largely deleterious and arises from molecular errors in transcription and RNA processing. The error hypothesis makes a distinct prediction that is not expected under the adaptive hypothesis: transcript diversity declines with the effective population size (N_e) of the species because natural selection minimizing errors is more effective under larger N_e. By analyzing 166 transcriptomes from 75 metazoans, we report that transcript diversity measured by the percentage uses of minor ATI, AS, and APA sites decreases with N_e or its proxies. This observation supports the error hypothesis and suggests that metazoan transcript diversity is largely deleterious.

Citation: Mi K, Guan L, Sarker B, Song S, Zhou T, Yi H, et al. (2026) Transcript diversity reflects deleterious RNA processing errors shaped by population size in metazoans. PLoS Biol 24(3): e3003671. https://doi.org/10.1371/journal.pbio.3003671

Academic Editor: Laurence D. Hurst, University of Bath, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND

Received: November 25, 2025; Accepted: February 11, 2026; Published: March 19, 2026

Copyright: © 2026 Mi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All processed data and custom scripts in this study are available in GitHub (https://github.com/XuLabSJTU/Ne) and Zenodo (https://doi.org/10.5281/zenodo.18514977).

Funding: This study was supported by National Science and Technology Innovation 2030 Major Projects for ‘Brain Science and Brain-Inspired Research’ (2022ZD0214400 to CX), National Natural Science Foundation of China (32270704 and 32472630 to CX), Medical-Engineering Crossover Fund of Shanghai Jiao Tong University (YG2025QNB51 to CX), and the U.S. National Institutes of Health (R35GM139484 to JZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Abbreviations: APA, alternative polyadenylation; BUSCO, Benchmarking Universal Single-Copy Orthologs; AS, alternative splicing; ATI, alternative transcription initiation; CAGE-seq, Cap Analysis of Gene Expression Sequencing; CTSS, CAGE transcriptional start site; PGLS, Phylogenetic Generalized Least Squares; TPM, Transcripts Per Million

Introduction

In eukaryotes, alternative transcription initiation (ATI) [1] and alternative polyadenylation (APA) [2] can respectively vary the beginning and end of a transcript produced from a gene, while alternative splicing (AS) can generate different RNA isoforms by selective inclusion or exclusion of exons in mRNA processing [3]. As a result, multiple different RNA transcripts (isoforms) are often produced from a single eukaryotic gene [4,5], generating transcript diversity. These transcripts may vary in their coding sequence, untranslated regions, and/or other regulatory elements [1–3]. ATI, APA, and AS are common phenomena in various eukaryotes such as fungi [6–8], plants [9–11], and animals [12–14]. For example, in humans, > 70% of genes exhibit APA [13], > 50% of genes display ATI [15], and >95% of multi-exon genes show AS [16], resulting in >170,000 transcripts recorded for ~20,000 human protein-coding genes (ENSEMBL genome reference consortium human build 38; GRCh38). ATI, APA, and AS may vary among tissues [17–19], across developmental stages [19–21], and during cell differentiation [22–24], and can contribute to disease [1,25,26].

Despite the universality of ATI, APA, and AS in eukaryotes, the biological significance of the created transcript diversity is debated. Some case studies suggested that different RNA isoforms are functionally distinct. For instance, the human Lef1 gene [27], mouse Ighm gene [28], and Drosophila Sxl, Tra, and Dsx genes [29] have functionally distinct RNA isoforms (and corresponding protein isoforms) produced by ATI, APA, and AS, respectively. Such examples led to the hypothesis that transcript/proteome diversity is generally adaptive and that ATI, APA, and AS are widely used, regulated mechanisms to expand transcript/proteome diversity [3,4,30,31]. However, a competing hypothesis known as the error hypothesis has also been suggested [32–38]. The error hypothesis contends that transcription and RNA processing are error-prone; consequently, the vast majority of the observed transcript diversity reflects molecular errors that not only lower the number of functional molecules and waste energy but may also create cytotoxicity [39]. Several lines of evidence support the error hypothesis [32–36,39,40]. For example, the error hypothesis predicts that transcript diversity is lower in relatively highly expressed genes than in relatively lowly expressed genes because of stronger selection minimizing error rates acting on highly than lowly expressed genes [39]. Empirical data indeed support this prediction [39].

Nonetheless, the level of transcript diversity has not been extensively compared across species. Such a comparison is useful for differentiating between the adaptive and error hypotheses, because the two hypotheses make distinct predictions about the relationship between the level of transcript diversity in a species and the effective population size (N_e) of the species. Specifically, under the error hypothesis, transcript diversity is due to deleterious molecular error, so is disfavored and lowered by natural selection. Because the efficacy of natural selection increases with N_e, we expect transcript diversity to decline with N_e [41,42]. Under the adaptive hypothesis, however, transcript diversity is beneficial so may be selectively elevated. Under this scenario, transcript diversity is expected to increase with N_e. However, one could also argue that, under the adaptive hypothesis, the optimal level of transcript diversity in a species depends on the specific condition and environment of the species; as a result, no prediction can be made regarding the relationship between the level of transcript diversity and N_e. At any rate, a negative correlation between N_e and transcript diversity is predicted by the error hypothesis but is not expected under the adaptive hypothesis. Indeed, this prediction was validated by a comparative analysis of AS across 53 species [36].

In the present work, by quantifying ATI, APA, and AS in 166 transcriptomes, we compared levels of transcript diversity among 75 metazoan species spanning a wide range of N_e. We found that transcript diversity generally decreases with N_e or its proxies, strengthening the previous support of the error hypothesis for AS [36] and providing new evidence for this hypothesis for ATI and APA.

Results

Genomic and transcriptomic data

Based on two recent studies [36,43], we assembled a list of 100 diverse metazoan species with available genome sequences (Fig 1A). We collected publicly available genome, genomic annotation, and transcriptome data from these 100 species (Fig 1B and S1 Data). The transcriptomic data are the basis of transcript diversity estimation and they comprise Cap Analysis of Gene Expression Sequencing (CAGE-seq) data from seven species for direct measurement of ATI, 3′-end-seq data from 10 species for direct measurement of APA, and RNA-seq data from 75 species for detection of AS and prediction of ATI and APA (Fig 1B and S1 Data).

Download:

Fig 1. Species phylogeny, genomic and transcriptomic data, and N_e proxies used in the present study.

(A) Phylogenetic tree of the 100 metazoan species considered. (B) Available genome and transcriptome datasets of each species concerned, including CAGE-seq, 3′-end-seq, RNA-seq, coding genes, and BUSCO genes. (C) N_e and proxies. Spearman’s correlation and Phylogenetic Generalized Least Squares (PGLS) regression between N_e and life span (D), body length (E), and the nonsynonymous to synonymous substitution rate ratio ω computed using all BUSCO genes (F) and all one-to-one orthologous genes (G) across species. Each dot represents a species, colored according to its clade in (A). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g001

Because transcript diversity estimated from different tissues may not be comparable across species, we focused on three tissues (brain, ovary, and testis) best represented in the transcriptomic data, respectively covering 67, 51, and 48 species (Fig 1B and S1 Data). In interspecific comparisons, we initially analyzed all protein-coding genes in each species (Fig 1 and S1 Data). However, gene set variations among species could introduce a confounder in our comparison. We therefore focused on Benchmarking Universal Single-Copy Orthologs (BUSCO) genes in an additional comparison (Fig 1B and S1 Data), as was done in previous studies [36,43].

N_e and proxies

Of the 100 species, 26 have published N_e (Fig 1C and S2 Data). Given that most species lack published N_e, we resorted to three other parameters as N_e proxies: the body length, life span, and nonsynonymous to synonymous substitution rate ratio (ω). Because larger animals and longer-lived animals tend to have smaller N_e, body length and life span have been used as N_e proxies [36,44,45]. Under the nearly neutral theory and the neutral assumption of synonymous mutations, ω is expected to decline with N_e [41] so can also be a proxy for N_e. However, when synonymous mutations are frequently non-neutral as has been documented in some species [46], the validity of the above expectation is uncertain; we therefore examined it empirically (see below). We collected body length and life span data from previous studies [36,47] and estimated ω for each species (see Materials and methods and S3 Data). We correlated the three proxies with N_e and confirmed their negative correlations (Fig 1D–1G), as reported previously [36]. Therefore, the error hypothesis predicts that transcript diversity should decline with N_e or increase with the three N_e proxies considered here. Note that throughout this study, we used two types of correlation analysis. The first is the simple rank correlation, whereas the second is Phylogenetic Generalized Least Squares (PGLS) regression, which controls for the phylogenetic relationships of the species considered in the data (see Materials and methods).

Interspecific variation in transcript diversity caused by APA

Accurate detection of APA typically relies on 3′-end RNA sequencing (i.e., 3′-end-seq), which can identify precise APA sites and their relative usages [48]. Several library preparation methods are available for 3′-end-seq [49], such as 3′READS [50], 3P-seq [51], and PAS-seq [52]. However, only a limited number of species have 3′-end-seq data [53]. We collected 3′-end-seq datasets from 10 species with N_e (S1 Data). For a given gene, let us refer to the most frequently used APA site as its major APA site, which is likely to be functionally the best APA site, and all other APA sites as its minor APA sites. We measured the transcript diversity of the gene caused by APA by the total percentage usage of its minor APA sites, which equals the total 3′-end reads of its minor APA sites divided by the total APA reads of the gene. We then averaged the transcript diversity across all genes considered in a species to represent the overall transcript diversity due to APA for the species. We found a negative correlation between transcript diversity and N_e across species, regardless of whether we considered all protein-coding genes (S1A Fig) or only BUSCO genes (Fig 2A). However, the negative correlations (or those from PGLS) were not significant, potentially due to the limited number of species in the analyses.

Download:

Fig 2. Transcript diversity caused by APA (measured by the mean total percentage usage of minor APA sites per gene) declines with N_e across species.

(A) Relationship between N_e and transcript diversity caused by APA estimated from 3′-end-seq. (B–E) Relationship between transcript diversity caused by APA estimated from RNA-seq and N_e (B), life span (C), body length (D), or ω (E) in the brain. P-values from Spearman’s correlation and PGLS are shown. N represents the number of species included. (F) Correlation between transcript diversity caused by APA and N_e (or proxies) in three tissues. BUSCO genes are used in all panels. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g002

To increase the number of species in the APA analysis, we predicted APA sites using RNA-seq data [48,54]. In particular, TAPAS leverages the Pruned Exact Linear Time algorithm, RNA-seq data, and gene structure information to predict APA sites and their abundances [55] and is known to outperform other tools [56]. We validated the performance of TAPAS by comparing the APA sites identified by 3′-end-seq with those predicted by TAPAS from RNA-seq using a dataset in which CD4 T cells were sequenced by both 3′-end-seq and RNA-seq [57]. We quantified APA site usage levels in both data types (see Materials and methods) and found a significant positive correlation between them (ρ > 0.27, P < 0.01; S1B Fig). We also computed the total percentage usage of minor APA sites for each gene in both data types and again observed a significant positive correlation between them (ρ > 0.16, P < 0.01; S1C Fig). To validate the applicability of APA site prediction by TAPAS in multiple species, we acquired RNA-seq datasets from the 10 species with 3′-end-seq data; while 3′-end-seq and RNA-seq were not generated from the same samples, they were from the same tissues in each of these species (S1 Data). We found that the average total percentage usage of minor APA sites per gene is significantly correlated between 3′-end-seq and RNA-seq data across species (ρ = 0.79, P = 9.8 × 10⁻³; S1D Fig). These results confirm the reliability of predicting APA sites from RNA-seq data.

We previously reported a negative correlation between the total percentage usage of minor APA sites of a gene and the gene expression level in each of five mammals studied, supporting the error hypothesis [32]. To assess whether this pattern extends beyond mammals, we repeated the analysis using APA sites predicted from RNA-seq data and found that the negative correlation persists in 163 of the 166 samples analyzed (S1E Fig), suggesting that this pattern is generally true across animals.

Next, we used RNA-seq data to estimate the average total percentage usage of minor APA sites per BUSCO gene in each of 75 species. We started with the brain because the number of species with RNA-seq data from this tissue is the highest. Across species, the above estimate of species transcript diversity reduces with N_e (ρ = −0.73, P = 4.3 × 10⁻⁴; P_PGLS = 1.8 × 10⁻²; Fig 2B), but increases with life span (ρ = 0.61, P = 9 × 10⁻⁵; P_PGLS = 4.2 × 10⁻²; Fig 2C), body length (ρ = 0.70, P = 2.4 × 10⁻⁶; P_PGLS = 4.4 × 10⁻⁷; Fig 2D), and ω (ρ = 0.63, P = 8.6 × 10⁻⁹; P_PGLS = 3.5 × 10⁻³; Fig 2E). Similar results were observed for the ovary and testis (Fig 2F). When the above transcript diversity in a species was calculated using all protein-coding genes, the patterns remain qualitatively unchanged (S1F Fig). Hence, patterns of interspecific variation in transcript diversity caused by APA support the error hypothesis.

Interspecific variation in transcript diversity caused by ATI

Although ATI is ideally assessed by CAGE-seq data [58], such data are available for only several species in our collection (S1 Data). Because ATI can be inferred from RNA-seq data [59,60], we chose to use RNA-seq to compare ATI across species to allow the inclusion of a broader range of species in our analysis. We used SEASTAR, a method known to outperform other tools [60], to predict ATI sites. To validate the SEASTAR prediction, we collected a set of samples sequenced by both CAGE-seq and RNA-seq [61] and, respectively, identified ATI sites from CAGE-seq and from RNA-seq using SEASTAR for all protein-coding genes. Analogous to the APA analysis, we quantified the expression level for each ATI site, the total percentage usage of minor ATI sites for each gene, and the average total percentage usage of minor ATI sites per gene in a species (see Materials and methods), and found them to respectively exhibit a significant, positive correlation between estimates from CAGE-seq and those from RNA-seq (ρ > 0.32, P ≤ 0.02; S2A–S2C Fig). These findings support the reliability of predicting ATI site usage from RNA-seq data.

Our previous study in humans and mice showed that the total percentage usage of minor ATI sites of a gene declines with the gene expression level, supporting the error hypothesis of ATI [33]. In the three tissues of every species investigated in the present study, the above trend is observed (S2D Fig).

Next, we calculated the average total percentage usage of minor ATI sites per gene for the BUSCO genes of a species using RNA-seq data and correlated it with N_e across species. Indeed, a significant, negative correlation was observed across 19 species (ρ = −0.74, P = 2.7 × 10⁻⁴; P_PGLS = 6.8 × 10⁻³; Fig 3A). We similarly observed a positive correlation when N_e is replaced with life span (ρ = 0.66, P = 1.3 × 10⁻⁵; P_PGLS = 3.1 × 10⁻²; Fig 3B), body length (ρ = 0.78, P = 2.5 × 10⁻⁸; P_PGLS = 1.8 × 10⁻⁹; Fig 3C), or ω (ρ = 0.49, P = 2.9 × 10⁻⁵; P_PGLS = 1.7 × 10⁻⁷; Fig 3D). Similar results were obtained for the testis and ovary for BUSCO genes (Fig 2E) or all protein-coding genes (S2E Fig).

Download:

Fig 3. Transcript diversity caused by ATI (measured by the mean total percentage usage of minor ATI sites per gene from RNA-seq) declines with N_e across species.

(A–D) Relationship between transcript diversity caused by APA and N_e (A), life span (B), body length (C), or ω (D) in the brain. P-values from Spearman’s correlation and PGLS are shown. N represents the number of species included. (E) Correlation between transcript diversity caused by ATI and N_e (or proxies) in three tissues. BUSCO genes are used in all panels. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g003

Interspecific variation in transcript diversity caused by AS

Previous studies have shown that AS is noisy [62] and mostly nonadaptive [34], consistent with the error hypothesis. To compare transcript diversity caused by AS across species, we assembled the transcripts for a gene and quantified the expression level of each transcript of the gene using StringTie, a widely-used tool outperforming others in the accuracy of both assembly and expression level measurement [63]. We then computed for each gene its transcript diversity due to AS by dividing the total splicing amount of all minor RNA splicing isoforms by the total splicing amount of all RNA splicing isoforms of the gene. Here, the splicing amount of an RNA splicing isoform is the total number of reads covering all splicing junctions of the isoform.

The error hypothesis predicts that the transcript diversity of a gene caused by AS should decrease with the expression level of the gene, because more highly expressed genes are subject to stronger selection against splicing error [64]. Indeed, we observed negative correlations in 158 of 166 samples (S3A Fig). These results confirm the previous findings from a limited number of species [34] and suggest that the error hypothesis of AS is broadly supported in animals.

Next, we calculated the mean transcript diversity (caused by AS) per gene for each species among BUSCO genes using data from the brain tissue. We found that this quantity significantly decreases with N_e (Fig 4A), but increases with life span (Fig 4B), body length (Fig 4C), and ω (Fig 4D). Similar results were observed for the ovary and testis (Fig 4E). These patterns remain qualitatively unchanged (S3B Fig) when all protein-coding genes were analyzed. Thus, consistent with a previous study [36], our across-species comparison of AS supports the error hypothesis.

Download:

Fig 4. Transcript diversity caused by AS (measured by the mean total percentage usage of splicing junctions in minor splicing isoforms per gene) declines with N_e across species.

(A–D) Relationship between transcript diversity caused by AS and N_e (A), life span (B), body length (C), or ω (D) in the brain. P-values from Spearman’s correlation and PGLS are shown. N represents the number of species included. (E) Correlation between transcript diversity caused by AS and N_e (or proxies) in three tissues. BUSCO genes are used in all panels. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.g004

Discussion

Transcript diversity primarily arises from ATI, APA, and AS. Although past studies have provided substantial genomic evidence for the error hypothesis of ATI [33], APA [32,65], and AS [34], these studies focused on a small number of species. As a result of this limitation and an increasing number of reports of cases of functional ATI, APA, or AS, the general biological significance of transcript diversity remains controversial. In the present study, we expanded the analysis to 75 species and showed that the previous finding from a small number of species generally hold across animals. More importantly, we found that the transcript diversity of a species declines with the species’ N_e or its proxies, as predicted by the error hypothesis.

A central theoretical underpinning of the non-adaptive paradigm relevant to our results is the drift-barrier model, which was first proposed by Lynch [66] to explain the mutation rate variation across species. This model predicts that the efficacy of selection is limited due to genetic drift and mutation bias, such that phenotypic traits of species with smaller N_e, where drift is more potent, are less optimized than those of species with larger N_e [42]. For transcript diversity, the drift-barrier model predicts that errors in transcriptional and post-transcriptional processing (e.g., incorrect AS, imprecise polyadenylation, or aberrant transcription initiation) that generate non-functional or weakly deleterious transcript isoforms will persist at higher frequencies in species with smaller N_e. Our observation that transcript diversity declines with N_e (and its proxies) confirms this prediction and hence supports the drift-barrier model.

Comparative studies across species often encounter confounding factors that could bias the outcome. For instance, estimating transcript diversity in this study relies on transcript annotations, which vary in completeness across species, with model or well-studied organisms typically having more comprehensive annotations, potentially introducing an interspecific bias. To minimize this bias, we performed de novo transcript assembly for each species using RNA-seq data, ensuring uniform annotation processes and quality (see Materials and methods). Although our approach may reduce annotation quality for model organisms, it mitigates the potential interspecific bias. Indeed, the observed patterns are unaltered by including (Figs 2E, 3D, and 4D) or excluding (S4 Fig) model species. Similarly, genome size and complexity can influence genome annotations and transcript diversity assessments. To address this issue, we employed BUSCO genes. These genes are single-copy highly conserved orthologs that are unaffected by genome size or complexity across species, ensuring comparability among taxa. Indeed, when using BUSCO genes to compute transcript diversity, we found that our conclusions hold regardless of whether genome size and complexity are controlled or not (S5 Fig). Thus, our results are robust to the above potential confounding factors in multispecies comparisons.

As mentioned, Benitiere and colleagues (2024) also reported a negative correlation between N_e and transcript diversity caused by AS across 53 species [36]. Nevertheless, our methodology differs from Benitiere and colleagues’s in several aspects. First, we calculated transcript diversity at the gene level, whereas Benitiere and colleagues measured it at the intron level. Second, Benitiere and colleagues combined multiple RNA-seq datasets to detect splicing events, which improved splicing event detection but introduced heterogeneity among tissues. By contrast, we compared the same tissue across species, which made the interspecific comparison fairer. Third, Benitiere and colleagues analyzed 53 species, most being insects, while our analysis encompassed 75 animals with a broader phylogenetic sampling. Fourth, in addition to the three N_e proxies used by Benitiere and colleagues in the correlation analysis, our study also used N_e from 26 species. Notwithstanding these methodological differences, the findings of the two studies are consistent.

In the debate about the biological significance of AS, several authors noted a significant positive correlation between the amount of AS of a species and its organismal complexity measured by the number of cell types [67,68]. Chen and colleagues [68] reported that this correlation remains even after the control for the species’ N_e, suggesting that AS is adaptive and is at least partially responsible for organismal complexity. Their analysis was recently criticized by Benitiere and colleagues [36] for using nucleotide diversity at synonymous sites (π_S) as a proxy for N_e. Synonymous mutations are often non-neutral [46], but even when they are neutral, π_S is determined by both N_e and the mutation rate per site per generation, the latter of which varies across species [42]. Hence, π_S is not an appropriate proxy for N_e. While Benitiere and colleagues suggested that ω would be a more appropriate proxy for N_e, they did not perform the actual partial correlation analysis. We therefore investigated the relationships among transcript diversity caused by AS, number of cell types, and ω across 12 species for which all three estimates are available in our brain tissue dataset (S4 Data). Consistent with the finding of Chen and colleagues [68], we observed a positive correlation between the transcript diversity caused by AS and the number of cell types across species even after the control for ω, but this partial correlation did not reach statistical significance in the PGLS analysis (S6 Fig). That is, after the control for the phylogenetic relationships in the data and ω (as a proxy for N_e), there is no significant partial correlation between organismal complexity measured by the number of cell types and transcript diversity caused by AS. We note that, even if the above partial correlation is significant, it does not mean that AS underlies organismal complexity. This is because, to demonstrate that AS contributes to organismal complexity, one needs to show that AS varies among cell types and plays a role in the functional diversity among cell types, which will be an interesting direction to pursue in the future when AS can be reliably assessed from single-cell RNA-seq data.

It is important to note that beyond APA, ATI, and AS, there are other variations in gene expression that result in gene product diversity, including post-transcriptional modifications (e.g., RNA editing [69] and m⁵C modifications [37]), translation variations (e.g., alternative translation initiation [70], mistranslation [71], and stop-codon read-through [72]), and post-translational modifications (e.g., phosphorylation). Notably, a recent study of the mis-transcription rate reveals a narrow range of variation across the tree of life [73]. Hence, it would be highly valuable to conduct cross-species comparisons as performed here for additional types of transcript diversity when appropriate data become available from a sufficient number of species.

Materials and methods

Genomes, transcriptomes, and N_e estimates

The species used in this study were sourced from Zhang and colleagues and Bénitière and colleagues [36,43]. The N_e estimates were acquired from the references listed in S2 Data, where N_e for the majority of species was inferred from measures of presumably neutral polymorphism (π) and germline mutation rate (µ), under the assumption of mutation-drift equilibrium, following the relationship N_e = π/(4µ) for diploid species. Data of body length, life span, and number of cell types were obtained from previous studies [36,47]. Reference genomes, protein sequences, cDNA sequences, and RNA-seq data were downloaded from the ENSEMBL (release 100) [74] and NCBI [75]. Accession numbers of RNA-seq samples are listed in S1 Data.

Phylogenetic tree

We obtained the phylogenetic tree of the species concerned from the Open Tree of Life [76]. We used BUSCO to identify single-copy protein-coding orthologous genes conserved in all 100 species, concatenated the protein sequences of the single-copy orthologs in each species, and performed multiple sequence alignments using MUSCLE [77] with the default parameters. Poorly aligned regions in the resulting alignment were removed using trimAl [78] with the “-automated1” method. The branch lengths of the tree were calculated using codeml from PAML with the JTT substitution model (“seqtype = 2, runmode = 0, model = 2, aaRateFile = jones.dat”).

ω estimation

For each focal species, we selected a triplet consisting of three species to estimate its ω value. Specifically, the triplet included the focal species, a closely related ingroup species, and an outgroup species. For instance, the ω values of human and chimpanzee were estimated using the triplet ((human, chimpanzee), gorilla), whereas the ω value for gorilla was estimated using ((human, gorilla), orangutan). For each triplet, we identified 1:1:1 orthologous protein-coding genes among the three species using OrthoFinder [79]. These orthologous genes were concatenated in the same order to construct a supergene. Multiple sequence alignments of the supergene protein sequences were performed using MUSCLE [77] with default parameters, and codon alignments were generated using TranAlign from the EMBOSS package [80]. Finally, we estimated the ω value of the supergene using the branch model (model = 1) implemented in codeml from the PAML package [81]. The resulting ω value of the supergene was taken as the ω value for that species. When focusing on BUSCO genes, we used only the BUSCO orthologs in estimating ω.

RNA-seq data processing

We utilized fastp [82] to perform quality control of RNA-seq raw reads and then aligned high-quality reads to the corresponding reference genomes using STAR 2.7.10a [83]. Read counts of a gene in each sample were obtained by featureCounts [84] based on the read alignment generated by STAR 2.7.10a. The total exon length of a gene, calculated using a custom script, is defined as the effective length of the gene. Finally, the expression level of a gene is calculated by Transcripts Per Million (TPM) [85]. To mitigate potential biases arising from differences in genome annotation across species, we used StringTie [63] to reassemble gene transcripts from RNA-seq data as reference RNA isoforms.

APA data processing

3′-end-seq data were downloaded from NCBI (S1 Data). Raw reads were inspected using FastQC and adapters were removed by Cutadapt. Next, clean reads were aligned to the corresponding reference genomes using Bowtie2 [86] with default parameters. Uniquely aligned reads were processed to define APA sites. APA sites supported by fewer than two 3′-end-seq reads were excluded from further analysis, and the remaining sites located within 30 base pairs of one another were merged into a single cluster. The APA site with the highest read count within each cluster represented the cluster, and the total number of 3′-end-seq reads mapped to all APA sites within a cluster was considered the expression abundance of the APA site. An APA site was assigned to a gene if it was mapped within the region spanning from the 5′-end of the gene to 1,000 bp downstream of the 3′-end of the gene. APA sites mapped to multiple genes were excluded from further analysis. The expression level of an APA site was calculated by dividing the total number of reads mapped to the site by the total number of reads from the corresponding library that were successfully mapped to the genome. The TAPAS [55] pipeline with default parameters was used to predict APA sites. For APA sites identified by 3′-end-seq or predicted from RNA-seq, we consider the APA site with the highest expression level in a gene to be the major APA site of the gene, while other APA sites were considered minor APA sites. The total percentage usage of minor APA sties in a gene was calculated by dividing the total expression level of minor APA sites by the total expression level of all APA sites of the gene.

ATI data processing

CAGE-seq data were downloaded from FANTOM5 [87] and NCBI (S1 Data). Raw CAGE-seq reads were inspected using fastQC and adapters were removed using Cutadapt. rRNAdust was then used to filter out rRNA reads. Next, the cleaned CAGE-seq reads were aligned to the genome using HISAT2 [88] with the default setting. Uniquely aligned reads were processed with the CAGEr [89] package and converted to quantified CAGE transcriptional start site (CTSS) coordinates. CTSSs supported by fewer than two CAGE-seq reads were excluded from further analysis. Remaining CTSSs located within 30 base pairs of one another were merged into a single cluster. The CTSS with the highest read count within each cluster was designated as the representative position of the cluster, referred to as an ATI site. The total number of CAGE-seq reads mapped to all CTSSs within a cluster was defined as the expression abundance of the corresponding ATI site. An ATI site was assigned to a gene if it was mapped within the region spanning from 1,000 bp upstream of the 5′-end of the gene to the 3′-end of the gene. ATI sites mapped to multiple genes were excluded from further analysis. The expression level of an ATI site was calculated by dividing the total number of reads mapped to the site by the total number of reads from the corresponding library that were successfully mapped to the genome. The SEASTAR [60] pipeline with default parameters was used to predict ATI sites from RNA-seq data. ATI sites with the first exon coverage of less than two reads were excluded. The expression level of an ATI site was calculated by dividing its first exon coverage by the product of the length of the exon and the total number of reads mapped to the genome in the library. For ATI sites identified by CAGE-seq or predicted from RNA-seq, the ATI site with the highest expression level for a gene was considered the major ATI site of the gene, while all other ATI sites were considered minor sites. The total percentage usage of minor ATI sties of a gene was computed by dividing the total expression level of minor ATI sites by the total expression level of all ATI sites of the gene.

AS data processing

The splicing junctions and expression level of each RNA isoform in a gene were respectively assembled and measured using StringTie [63] after STAR alignment. The isoform with the highest TPM in a gene was regarded as the major RNA transcript of the gene, while other isoforms were considered minor RNA transcripts. For junctions that are shared between multiple transcripts, we reassigned junction reads to these transcripts based on their expression levels. The total percentage usage of junctions in minor RNA splicing isoforms was calculated by dividing the splicing amount in minor RNA splicing isoforms of a gene by the total splicing amount of the gene.

Data analysis

Data analysis was conducted using R statistical software (v4.2). To take phylogenetic inertia into account, we performed PGLS regression [36] by R package “caper” when conducting cross-species correlations.

Supporting information

S1 Fig. Analysis of transcript diversity caused by APA (measured by total percentage usage of minor APA sites) using 3′-end-seq and RNA-seq.

(A) Correlation between N_e and transcript diversity caused by APA estimated from 3′-end-seq across seven metazoans. All protein-coding genes are used in the analysis. (B) Spearman’s correlation between the expression level of an APA site quantified by 3′-end-seq and that predicted by RNA-seq. (C) Spearman’s correlation between the total percentage usage of minor APA sites in a gene quantified by 3′-end-seq and that predicted by RNA-seq. The X-axes in (B) and (C) show the SRA accession numbers of RNA-seq and 3′-end-seq data from the same sample. (D) Spearman’s correlation between transcript diversity caused by APA in a species qualified by 3′-end-seq and that predicted by RNA-seq. (E) Spearman’s correlation between the gene expression level and the total percentage usage of minor APA sites across genes in each of three tissues in each of 75 species. Each row represents a species. (F) Correlation between transcript diversity caused by APA and N_e, life span, body length, or ω across species in three tissues. All protein-coding genes are used in (F). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s001

(TIF)

S2 Fig. Analysis of transcript diversity caused by ATI (measured by the total percentage usage of minor ATI sites) using CAGE-seq and RNA-seq.

(A) Spearman’s correlation between the expression level of an ATI site quantified by CAGE-seq and that predicted by RNA-seq. (B) Spearman’s correlation between the total percentage usage of minor ATI sites in a gene quantified by CAGE-seq and that predicted by RNA-seq. The X-axes in (A) and (B) show the SRA accession numbers of RNA-seq and CAGE-seq data from the same sample. (C) Spearman’s correlation between transcript diversity caused by ATI in a species quantified by CAGE-seq and that predicted by RNA-seq. Each dot is a species. (D) Spearman’s correlation between the gene expression level and the total percentage usage of minor ATI sites across genes in each of three tissues in each of 75 species. Each row represents a species. (E) Correlation between transcript diversity caused by ATI and N_e, life span, body length, or ω across species in three tissues. All protein-coding genes are used in (E). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s002

(TIF)

S3 Fig. Analysis of transcript diversity caused by AS (measured by the total percentage usage of splicing junctions in minor RNA splicing isoforms).

(A) Spearman’s correlation between the gene expression level and the total percentage usage of splicing junctions in minor RNA splicing isoforms across genes in each of three tissues in each of 75 species. Each row represents a species. (B) Correlation between transcript diversity caused by ATI and N_e, life span, body length, or ω across species in three tissues. All protein-coding genes are used in (B). The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s003

(TIF)

S4 Fig. Correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after removing 9 model species (Homo sapiens, Macaca mulatta, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, Drosophila melanogaster, Aedes aegypti, and Caenorhabditis elegans).

The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s004

(TIF)

S5 Fig. Partial correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after the control for genome size and complexity.

Dots represent the raw data. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s005

(TIF)

S6 Fig. Partial correlation between transcript diversity caused by AS and the number of cell types after the control for ω.

Dots represent the raw data. The data underlying this Figure can be found in https://doi.org/10.5281/zenodo.18514977.

https://doi.org/10.1371/journal.pbio.3003671.s006

(TIF)

S1 Data. Available datasets of transcriptomes and genomes for the species investigated.

https://doi.org/10.1371/journal.pbio.3003671.s007

(XLSX)

S2 Data. N_e of 26 metazoans reported in previous studies.

https://doi.org/10.1371/journal.pbio.3003671.s008

(DOCX)

S3 Data. N_e and proxies.

https://doi.org/10.1371/journal.pbio.3003671.s009

(XLSX)

S4 Data. Twelve species with available information about the number of cell types.

https://doi.org/10.1371/journal.pbio.3003671.s010

(XLSX)

Acknowledgments

We thank Michael Lynch for valuable comments

References

1. Alfonso-Gonzalez C, Hilgers V. (Alternative) transcription start sites as regulators of RNA processing. Trends Cell Biol. 2024;34(12):1018–28. pmid:38531762
- View Article
- PubMed/NCBI
- Google Scholar
2. Elkon R, Ugalde AP, Agami R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet. 2013;14(7):496–506. pmid:23774734
- View Article
- PubMed/NCBI
- Google Scholar
3. Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet. 2010;11(5):345–55. pmid:20376054
- View Article
- PubMed/NCBI
- Google Scholar
4. de Klerk E, ’t Hoen PAC. Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Genet. 2015;31(3):128–39. pmid:25648499
- View Article
- PubMed/NCBI
- Google Scholar
5. Licatalosi DD, Darnell RB. RNA processing and its regulation: global insights into biological networks. Nat Rev Genet. 2010;11(1):75–87. pmid:20019688
- View Article
- PubMed/NCBI
- Google Scholar
6. Dang TTV, Colin J, Janbon G. Alternative transcription start site usage and functional implications in pathogenic fungi. J Fungi (Basel). 2022;8(10):1044. pmid:36294609
- View Article
- PubMed/NCBI
- Google Scholar
7. Fang S, Hou X, Qiu K, He R, Feng X, Liang X. The occurrence and function of alternative splicing in fungi. Fungal Biol Rev. 2020;34(4):178–88.
- View Article
- Google Scholar
8. Liu X, Hoque M, Larochelle M, Lemay J-F, Yurko N, Manley JL, et al. Comparative analysis of alternative polyadenylation in S. cerevisiae and S. pombe. Genome Res. 2017;27(10):1685–95. pmid:28916539
- View Article
- PubMed/NCBI
- Google Scholar
9. Xing D, Li QQ. Alternative polyadenylation and gene expression regulation in plants. Wiley Interdiscip Rev RNA. 2011;2(3):445–58. pmid:21957029
- View Article
- PubMed/NCBI
- Google Scholar
10. Syed NH, Kalyna M, Marquez Y, Barta A, Brown JWS. Alternative splicing in plants – coming of age. Trends Plant Sci. 2012;17(10):616–23. pmid:22743067
- View Article
- PubMed/NCBI
- Google Scholar
11. Le NT, Harukawa Y, Miura S, Boer D, Kawabe A, Saze H. Epigenetic regulation of spurious transcription initiation in Arabidopsis. Nat Commun. 2020;11(1):3224. pmid:32591528
- View Article
- PubMed/NCBI
- Google Scholar
12. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, et al. A promoter-level mammalian expression atlas. Nature. 2014;507(7493):462–70. pmid:24670764
- View Article
- PubMed/NCBI
- Google Scholar
13. Derti A, Garrett-Engele P, Macisaac KD, Stevens RC, Sriram S, Chen R, et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 2012;22(6):1173–83. pmid:22454233
- View Article
- PubMed/NCBI
- Google Scholar
14. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, et al. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338(6114):1587–93. pmid:23258890
- View Article
- PubMed/NCBI
- Google Scholar
15. Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, et al. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. 2006;16(1):55–65. pmid:16344560
- View Article
- PubMed/NCBI
- Google Scholar
16. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40(12):1413–5. pmid:18978789
- View Article
- PubMed/NCBI
- Google Scholar
17. Shephard EA, Chandan P, Stevanovic-Walker M, Edwards M, Phillips IR. Alternative promoters and repetitive DNA elements define the species-dependent tissue-specific expression of the FMO1 genes of human and mouse. Biochem J. 2007;406(3):491–9. pmid:17547558
- View Article
- PubMed/NCBI
- Google Scholar
18. Lianoglou S, Garg V, Yang JL, Leslie CS, Mayr C. Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 2013;27(21):2380–96. pmid:24145798
- View Article
- PubMed/NCBI
- Google Scholar
19. Baralle FE, Giudice J. Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol. 2017;18(7):437–51. pmid:28488700
- View Article
- PubMed/NCBI
- Google Scholar
20. Davis W Jr, Schultz RM. Developmental change in TATA-box utilization during preimplantation mouse development. Dev Biol. 2000;218(2):275–83. pmid:10656769
- View Article
- PubMed/NCBI
- Google Scholar
21. Ulitsky I, Shkumatava A, Jan CH, Subtelny AO, Koppstein D, Bell GW, et al. Extensive alternative polyadenylation during zebrafish development. Genome Res. 2012;22(10):2054–66. pmid:22722342
- View Article
- PubMed/NCBI
- Google Scholar
22. Pozner A, Lotem J, Xiao C, Goldenberg D, Brenner O, Negreanu V, et al. Developmentally regulated promoter-switch transcriptionally controls Runx1 function during embryonic hematopoiesis. BMC Dev Biol. 2007;7:84. pmid:17626615
- View Article
- PubMed/NCBI
- Google Scholar
23. Cheng LC, Zheng D, Baljinnyam E, Sun F, Ogami K, Yeung PL, et al. Widespread transcript shortening through alternative polyadenylation in secretory cell differentiation. Nat Commun. 2020;11(1):3182. pmid:32576858
- View Article
- PubMed/NCBI
- Google Scholar
24. Fiszbein A, Kornblihtt AR. Alternative splicing switches: important players in cell differentiation. Bioessays. 2017;39(6):10.1002/bies.201600157. pmid:28452057
- View Article
- PubMed/NCBI
- Google Scholar
25. Tazi J, Bakkour N, Stamm S. Alternative splicing and disease. Biochim Biophys Acta. 2009;1792(1):14–26. pmid:18992329
- View Article
- PubMed/NCBI
- Google Scholar
26. Gruber AJ, Zavolan M. Alternative cleavage and polyadenylation in health and disease. Nat Rev Genet. 2019;20(10):599–614. pmid:31267064
- View Article
- PubMed/NCBI
- Google Scholar
27. Arce L, Yokoyama NN, Waterman ML. Diversity of LEF/TCF action in development and disease. Oncogene. 2006;25(57):7492–504. pmid:17143293
- View Article
- PubMed/NCBI
- Google Scholar
28. Peterson ML. Mechanisms controlling production of membrane and secreted immunoglobulin during B cell development. Immunol Res. 2007;37(1):33–46. pmid:17496345
- View Article
- PubMed/NCBI
- Google Scholar
29. Salz HK, Erickson JW. Sex determination in Drosophila: the view from the top. Fly (Austin). 2010;4(1):60–70. pmid:20160499
- View Article
- PubMed/NCBI
- Google Scholar
30. Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH-M. The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 2008;24(4):167–77. pmid:18329129
- View Article
- PubMed/NCBI
- Google Scholar
31. Mayr C. Evolution and biological roles of alternative 3’UTRs. Trends Cell Biol. 2016;26(3):227–37. pmid:26597575
- View Article
- PubMed/NCBI
- Google Scholar
32. Xu C, Zhang J. Alternative polyadenylation of mammalian transcripts is generally deleterious, not adaptive. Cell Syst. 2018;6(6):734-742.e4. pmid:29886108
- View Article
- PubMed/NCBI
- Google Scholar
33. Xu C, Park J-K, Zhang J. Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biol. 2019;17(3):e3000197. pmid:30883542
- View Article
- PubMed/NCBI
- Google Scholar
34. Saudemont B, Popa A, Parmley JL, Rocher V, Blugeon C, Necsulea A, et al. The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome Biol. 2017;18(1):208. pmid:29084568
- View Article
- PubMed/NCBI
- Google Scholar
35. Xu C, Zhang J. Mammalian circular RNAs result largely from splicing errors. Cell Rep. 2021;36(4):109439. pmid:34320353
- View Article
- PubMed/NCBI
- Google Scholar
36. Bénitière F, Necsulea A, Duret L. Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. Elife. 2024;13:RP93629. pmid:38470242
- View Article
- PubMed/NCBI
- Google Scholar
37. Li Z, Mi K, Xu C. Most m5C Modifications in Mammalian mRNAs are Nonadaptive. Mol Biol Evol. 2025;42(1):msaf008. pmid:39824217
- View Article
- PubMed/NCBI
- Google Scholar
38. Li Z, Sarker B, Zhao F, Zhou T, Zhang J, Xu C. COL: a method for identifying putatively functional circular RNAs. J Genet Genomics. 2024;51(11):1338–41. pmid:39218058
- View Article
- PubMed/NCBI
- Google Scholar
39. Zhang J, Xu C. Gene product diversity: adaptive or not?. Trends Genet. 2022;38(11):1112–22. pmid:35641344
- View Article
- PubMed/NCBI
- Google Scholar
40. Melamud E, Moult J. Stochastic noise in splicing machinery. Nucleic Acids Res. 2009;37(14):4873–86. pmid:19546110
- View Article
- PubMed/NCBI
- Google Scholar
41. Ohta T. The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 1992;23(1):263–86.
- View Article
- Google Scholar
42. Lynch M, Ackerman MS, Gout J-F, Long H, Sung W, Thomas WK, et al. Genetic drift, selection and the evolution of the mutation rate. Nat Rev Genet. 2016;17(11):704–14. pmid:27739533
- View Article
- PubMed/NCBI
- Google Scholar
43. Zhang H, Wang Y, Wu X, Tang X, Wu C, Lu J. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat Commun. 2021;12(1):1076. pmid:33597535
- View Article
- PubMed/NCBI
- Google Scholar
44. Figuet E, Nabholz B, Bonneau M, Mas Carrio E, Nadachowska-Brzyska K, Ellegren H, et al. Life history traits, protein evolution, and the nearly neutral theory in amniotes. Mol Biol Evol. 2016;33(6):1517–27. pmid:26944704
- View Article
- PubMed/NCBI
- Google Scholar
45. Waples RS. Life-history traits and effective population size in species with overlapping generations revisited: the importance of adult mortality. Heredity (Edinb). 2016;117(4):241–50. pmid:27273324
- View Article
- PubMed/NCBI
- Google Scholar
46. Zhang J, Qian W. Functional synonymous mutations and their evolutionary consequences. Nat Rev Genet. 2025;26(11):789–804. pmid:40394196
- View Article
- PubMed/NCBI
- Google Scholar
47. Alvarez-Ponce D, Krishnamurthy S. Organismal complexity strongly correlates with the number of protein families and domains. Proc Natl Acad Sci U S A. 2025;122(5):e2404332122. pmid:39874285
- View Article
- PubMed/NCBI
- Google Scholar
48. Chen W, Jia Q, Song Y, Fu H, Wei G, Ni T. Alternative polyadenylation: methods, findings, and impacts. Genomics Proteomics Bioinformatics. 2017;15(5):287–300. pmid:29031844
- View Article
- PubMed/NCBI
- Google Scholar
49. Wu G, Schmid M, Jensen TH. 3’ End sequencing of pA+ and pA- RNAs. Methods Enzymol. 2021;655:139–64. pmid:34183119
- View Article
- PubMed/NCBI
- Google Scholar
50. Hoque M, Ji Z, Zheng D, Luo W, Li W, You B, et al. Analysis of alternative cleavage and polyadenylation by 3’ region extraction and deep sequencing. Nat Methods. 2013;10(2):133–9. pmid:23241633
- View Article
- PubMed/NCBI
- Google Scholar
51. Jan CH, Friedman RC, Ruby JG, Bartel DP. Formation, regulation and evolution of Caenorhabditis elegans 3’UTRs. Nature. 2011;469(7328):97–101. pmid:21085120
- View Article
- PubMed/NCBI
- Google Scholar
52. Shepard PJ, Choi E-A, Lu J, Flanagan LA, Hertel KJ, Shi Y. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA. 2011;17(4):761–72. pmid:21343387
- View Article
- PubMed/NCBI
- Google Scholar
53. Herrmann CJ, Schmidt R, Kanitz A, Artimo P, Gruber AJ, Zavolan M. PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3’ end sequencing. Nucleic Acids Res. 2020;48(D1):D174–9. pmid:31617559
- View Article
- PubMed/NCBI
- Google Scholar
54. Ye W, Lian Q, Ye C, Wu X. A survey on methods for predicting polyadenylation sites from DNA sequences, bulk RNA-seq, and single-cell RNA-seq. Genomics Proteomics Bioinformatics. 2023;21(1):67–83. pmid:36167284
- View Article
- PubMed/NCBI
- Google Scholar
55. Arefeen A, Liu J, Xiao X, Jiang T. TAPAS: tool for alternative polyadenylation site analysis. Bioinformatics. 2018;34(15):2521–9. pmid:30052912
- View Article
- PubMed/NCBI
- Google Scholar
56. Chen M, Ji G, Fu H, Lin Q, Ye C, Ye W, et al. A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data. Brief Bioinform. 2020;21(4):1261–76. pmid:31267126
- View Article
- PubMed/NCBI
- Google Scholar
57. Tang P, Yang Y, Li G, Huang L, Wen M, Ruan W, et al. Alternative polyadenylation by sequential activation of distal and proximal PolyA sites. Nat Struct Mol Biol. 2022;29(1):21–31. pmid:35013598
- View Article
- PubMed/NCBI
- Google Scholar
58. Adiconis X, Haber AL, Simmons SK, Levy Moonshine A, Ji Z, Busby MA, et al. Comprehensive comparative analysis of 5’-end RNA-sequencing methods. Nat Methods. 2018;15(7):505–11. pmid:29867192
- View Article
- PubMed/NCBI
- Google Scholar
59. Cass AA, Xiao X. mountainClimber Identifies Alternative Transcription Start and Polyadenylation Sites in RNA-Seq. Cell Syst. 2019;9(4):393-400.e6. pmid:31542416
- View Article
- PubMed/NCBI
- Google Scholar
60. Qin Z, Stoilov P, Zhang X, Xing Y. SEASTAR: systematic evaluation of alternative transcription start sites in RNA. Nucleic Acids Res. 2018;46(8):e45. pmid:29546410
- View Article
- PubMed/NCBI
- Google Scholar
61. Gacita AM, Dellefave-Castillo L, Page PGT, Barefield DY, Wasserstrom JA, Puckelwartz MJ, et al. Altered enhancer and promoter usage leads to differential gene expression in the normal and failed human heart. Circ Heart Fail. 2020;13(10):e006926. pmid:32993371
- View Article
- PubMed/NCBI
- Google Scholar
62. Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 2010;6(12):e1001236. pmid:21151575
- View Article
- PubMed/NCBI
- Google Scholar
63. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5. pmid:25690850
- View Article
- PubMed/NCBI
- Google Scholar
64. Zhang J, Yang J-R. Determinants of the rate of protein sequence evolution. Nat Rev Genet. 2015;16(7):409–20. pmid:26055156
- View Article
- PubMed/NCBI
- Google Scholar
65. Xu C, Zhang J. A different perspective on alternative cleavage and polyadenylation. Nat Rev Genet. 2020;21(1):63. pmid:31745293
- View Article
- PubMed/NCBI
- Google Scholar
66. Lynch M. Evolution of the mutation rate. Trends Genet. 2010;26(8):345–52. pmid:20594608
- View Article
- PubMed/NCBI
- Google Scholar
67. Schad E, Tompa P, Hegyi H. The relationship between proteome size, structural disorder and organism complexity. Genome Biol. 2011;12(12):R120. pmid:22182830
- View Article
- PubMed/NCBI
- Google Scholar
68. Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Mol Biol Evol. 2014;31(6):1402–13. pmid:24682283
- View Article
- PubMed/NCBI
- Google Scholar
69. Xu G, Zhang J. Human coding RNA editing is generally nonadaptive. Proc Natl Acad Sci U S A. 2014;111(10):3769–74. pmid:24567376
- View Article
- PubMed/NCBI
- Google Scholar
70. Xu C, Zhang J. Mammalian alternative translation initiation is mostly nonadaptive. Mol Biol Evol. 2020;37(7):2015–28. pmid:32145028
- View Article
- PubMed/NCBI
- Google Scholar
71. Sun M, Zhang J. Preferred synonymous codons are translated more accurately: Proteomic evidence, among-species variation, and mechanistic basis. Sci Adv. 2022;8(27):eabl9812. pmid:35857447
- View Article
- PubMed/NCBI
- Google Scholar
72. Li C, Zhang J. Stop-codon read-through arises largely from molecular errors and is generally nonadaptive. PLoS Genet. 2019;15(5):e1008141. pmid:31120886
- View Article
- PubMed/NCBI
- Google Scholar
73. Li W, Baehr S, Marasco M, Reyes L, Brister D, Pikaard CS, et al. A narrow range of transcript-error rates across the Tree of Life. Sci Adv. 2025;11(28):eadv9898. pmid:40644547
- View Article
- PubMed/NCBI
- Google Scholar
74. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50(D1):D988–95. pmid:34791404
- View Article
- PubMed/NCBI
- Google Scholar
75. Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021;49(D1):D10–7. pmid:33095870
- View Article
- PubMed/NCBI
- Google Scholar
76. Hinchliff CE, Smith SA, Allman JF, Burleigh JG, Chaudhary R, Coghill LM, et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc Natl Acad Sci U S A. 2015;112(41):12764–9. pmid:26385966
- View Article
- PubMed/NCBI
- Google Scholar
77. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. pmid:15034147
- View Article
- PubMed/NCBI
- Google Scholar
78. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3. pmid:19505945
- View Article
- PubMed/NCBI
- Google Scholar
79. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16(1):157. pmid:26243257
- View Article
- PubMed/NCBI
- Google Scholar
80. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16(6):276–7. pmid:10827456
- View Article
- PubMed/NCBI
- Google Scholar
81. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91. pmid:17483113
- View Article
- PubMed/NCBI
- Google Scholar
82. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884–90. pmid:30423086
- View Article
- PubMed/NCBI
- Google Scholar
83. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886
- View Article
- PubMed/NCBI
- Google Scholar
84. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30. pmid:24227677
- View Article
- PubMed/NCBI
- Google Scholar
85. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5. pmid:22872506
- View Article
- PubMed/NCBI
- Google Scholar
86. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. pmid:22388286
- View Article
- PubMed/NCBI
- Google Scholar
87. Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015;16(1):22. pmid:25723102
- View Article
- PubMed/NCBI
- Google Scholar
88. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60. pmid:25751142
- View Article
- PubMed/NCBI
- Google Scholar
89. Haberle V, Forrest ARR, Hayashizaki Y, Carninci P, Lenhard B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015;43(8):e51. pmid:25653163
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Alfonso-Gonzalez C, Hilgers V. (Alternative) transcription start sites as regulators of RNA processing. Trends Cell Biol. 2024;34(12):1018–28. pmid:38531762
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Elkon R, Ugalde AP, Agami R. Alternative cleavage and polyadenylation: extent, regulation and function. Nat Rev Genet. 2013;14(7):496–506. pmid:23774734
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Keren H, Lev-Maor G, Ast G. Alternative splicing and evolution: diversification, exon definition and function. Nat Rev Genet. 2010;11(5):345–55. pmid:20376054
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. de Klerk E, ’t Hoen PAC. Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Genet. 2015;31(3):128–39. pmid:25648499
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Licatalosi DD, Darnell RB. RNA processing and its regulation: global insights into biological networks. Nat Rev Genet. 2010;11(1):75–87. pmid:20019688
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Dang TTV, Colin J, Janbon G. Alternative transcription start site usage and functional implications in pathogenic fungi. J Fungi (Basel). 2022;8(10):1044. pmid:36294609
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Fang S, Hou X, Qiu K, He R, Feng X, Liang X. The occurrence and function of alternative splicing in fungi. Fungal Biol Rev. 2020;34(4):178–88.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref8] 8. Liu X, Hoque M, Larochelle M, Lemay J-F, Yurko N, Manley JL, et al. Comparative analysis of alternative polyadenylation in S. cerevisiae and S. pombe. Genome Res. 2017;27(10):1685–95. pmid:28916539
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Xing D, Li QQ. Alternative polyadenylation and gene expression regulation in plants. Wiley Interdiscip Rev RNA. 2011;2(3):445–58. pmid:21957029
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Syed NH, Kalyna M, Marquez Y, Barta A, Brown JWS. Alternative splicing in plants – coming of age. Trends Plant Sci. 2012;17(10):616–23. pmid:22743067
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Le NT, Harukawa Y, Miura S, Boer D, Kawabe A, Saze H. Epigenetic regulation of spurious transcription initiation in Arabidopsis. Nat Commun. 2020;11(1):3224. pmid:32591528
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, et al. A promoter-level mammalian expression atlas. Nature. 2014;507(7493):462–70. pmid:24670764
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref13] 13. Derti A, Garrett-Engele P, Macisaac KD, Stevens RC, Sriram S, Chen R, et al. A quantitative atlas of polyadenylation in five mammals. Genome Res. 2012;22(6):1173–83. pmid:22454233
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref14] 14. Barbosa-Morais NL, Irimia M, Pan Q, Xiong HY, Gueroussov S, Lee LJ, et al. The evolutionary landscape of alternative splicing in vertebrate species. Science. 2012;338(6114):1587–93. pmid:23258890
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref15] 15. Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, et al. Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res. 2006;16(1):55–65. pmid:16344560
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref16] 16. Pan Q, Shai O, Lee LJ, Frey BJ, Blencowe BJ. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat Genet. 2008;40(12):1413–5. pmid:18978789
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref17] 17. Shephard EA, Chandan P, Stevanovic-Walker M, Edwards M, Phillips IR. Alternative promoters and repetitive DNA elements define the species-dependent tissue-specific expression of the FMO1 genes of human and mouse. Biochem J. 2007;406(3):491–9. pmid:17547558
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

[ref18] 18. Lianoglou S, Garg V, Yang JL, Leslie CS, Mayr C. Ubiquitously transcribed genes use alternative polyadenylation to achieve tissue-specific expression. Genes Dev. 2013;27(21):2380–96. pmid:24145798
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Baralle FE, Giudice J. Alternative splicing as a regulator of development and tissue identity. Nat Rev Mol Cell Biol. 2017;18(7):437–51. pmid:28488700
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref20] 20. Davis W Jr, Schultz RM. Developmental change in TATA-box utilization during preimplantation mouse development. Dev Biol. 2000;218(2):275–83. pmid:10656769
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref21] 21. Ulitsky I, Shkumatava A, Jan CH, Subtelny AO, Koppstein D, Bell GW, et al. Extensive alternative polyadenylation during zebrafish development. Genome Res. 2012;22(10):2054–66. pmid:22722342
View Article
PubMed/NCBI
Google Scholar

[81] View Article

[82] PubMed/NCBI

[83] Google Scholar

[ref22] 22. Pozner A, Lotem J, Xiao C, Goldenberg D, Brenner O, Negreanu V, et al. Developmentally regulated promoter-switch transcriptionally controls Runx1 function during embryonic hematopoiesis. BMC Dev Biol. 2007;7:84. pmid:17626615
View Article
PubMed/NCBI
Google Scholar

[85] View Article

[86] PubMed/NCBI

[87] Google Scholar

[ref23] 23. Cheng LC, Zheng D, Baljinnyam E, Sun F, Ogami K, Yeung PL, et al. Widespread transcript shortening through alternative polyadenylation in secretory cell differentiation. Nat Commun. 2020;11(1):3182. pmid:32576858
View Article
PubMed/NCBI
Google Scholar

[89] View Article

[90] PubMed/NCBI

[91] Google Scholar

[ref24] 24. Fiszbein A, Kornblihtt AR. Alternative splicing switches: important players in cell differentiation. Bioessays. 2017;39(6):10.1002/bies.201600157. pmid:28452057
View Article
PubMed/NCBI
Google Scholar

[93] View Article

[94] PubMed/NCBI

[95] Google Scholar

[ref25] 25. Tazi J, Bakkour N, Stamm S. Alternative splicing and disease. Biochim Biophys Acta. 2009;1792(1):14–26. pmid:18992329
View Article
PubMed/NCBI
Google Scholar

[97] View Article

[98] PubMed/NCBI

[99] Google Scholar

[ref26] 26. Gruber AJ, Zavolan M. Alternative cleavage and polyadenylation in health and disease. Nat Rev Genet. 2019;20(10):599–614. pmid:31267064
View Article
PubMed/NCBI
Google Scholar

[101] View Article

[102] PubMed/NCBI

[103] Google Scholar

[ref27] 27. Arce L, Yokoyama NN, Waterman ML. Diversity of LEF/TCF action in development and disease. Oncogene. 2006;25(57):7492–504. pmid:17143293
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref28] 28. Peterson ML. Mechanisms controlling production of membrane and secreted immunoglobulin during B cell development. Immunol Res. 2007;37(1):33–46. pmid:17496345
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref29] 29. Salz HK, Erickson JW. Sex determination in Drosophila: the view from the top. Fly (Austin). 2010;4(1):60–70. pmid:20160499
View Article
PubMed/NCBI
Google Scholar

[113] View Article

[114] PubMed/NCBI

[115] Google Scholar

[ref30] 30. Davuluri RV, Suzuki Y, Sugano S, Plass C, Huang TH-M. The functional consequences of alternative promoter use in mammalian genomes. Trends Genet. 2008;24(4):167–77. pmid:18329129
View Article
PubMed/NCBI
Google Scholar

[117] View Article

[118] PubMed/NCBI

[119] Google Scholar

[ref31] 31. Mayr C. Evolution and biological roles of alternative 3’UTRs. Trends Cell Biol. 2016;26(3):227–37. pmid:26597575
View Article
PubMed/NCBI
Google Scholar

[121] View Article

[122] PubMed/NCBI

[123] Google Scholar

[ref32] 32. Xu C, Zhang J. Alternative polyadenylation of mammalian transcripts is generally deleterious, not adaptive. Cell Syst. 2018;6(6):734-742.e4. pmid:29886108
View Article
PubMed/NCBI
Google Scholar

[125] View Article

[126] PubMed/NCBI

[127] Google Scholar

[ref33] 33. Xu C, Park J-K, Zhang J. Evidence that alternative transcriptional initiation is largely nonadaptive. PLoS Biol. 2019;17(3):e3000197. pmid:30883542
View Article
PubMed/NCBI
Google Scholar

[129] View Article

[130] PubMed/NCBI

[131] Google Scholar

[ref34] 34. Saudemont B, Popa A, Parmley JL, Rocher V, Blugeon C, Necsulea A, et al. The fitness cost of mis-splicing is the main determinant of alternative splicing patterns. Genome Biol. 2017;18(1):208. pmid:29084568
View Article
PubMed/NCBI
Google Scholar

[133] View Article

[134] PubMed/NCBI

[135] Google Scholar

[ref35] 35. Xu C, Zhang J. Mammalian circular RNAs result largely from splicing errors. Cell Rep. 2021;36(4):109439. pmid:34320353
View Article
PubMed/NCBI
Google Scholar

[137] View Article

[138] PubMed/NCBI

[139] Google Scholar

[ref36] 36. Bénitière F, Necsulea A, Duret L. Random genetic drift sets an upper limit on mRNA splicing accuracy in metazoans. Elife. 2024;13:RP93629. pmid:38470242
View Article
PubMed/NCBI
Google Scholar

[141] View Article

[142] PubMed/NCBI

[143] Google Scholar

[ref37] 37. Li Z, Mi K, Xu C. Most m5C Modifications in Mammalian mRNAs are Nonadaptive. Mol Biol Evol. 2025;42(1):msaf008. pmid:39824217
View Article
PubMed/NCBI
Google Scholar

[145] View Article

[146] PubMed/NCBI

[147] Google Scholar

[ref38] 38. Li Z, Sarker B, Zhao F, Zhou T, Zhang J, Xu C. COL: a method for identifying putatively functional circular RNAs. J Genet Genomics. 2024;51(11):1338–41. pmid:39218058
View Article
PubMed/NCBI
Google Scholar

[149] View Article

[150] PubMed/NCBI

[151] Google Scholar

[ref39] 39. Zhang J, Xu C. Gene product diversity: adaptive or not?. Trends Genet. 2022;38(11):1112–22. pmid:35641344
View Article
PubMed/NCBI
Google Scholar

[153] View Article

[154] PubMed/NCBI

[155] Google Scholar

[ref40] 40. Melamud E, Moult J. Stochastic noise in splicing machinery. Nucleic Acids Res. 2009;37(14):4873–86. pmid:19546110
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref41] 41. Ohta T. The nearly neutral theory of molecular evolution. Annu Rev Ecol Syst. 1992;23(1):263–86.
View Article
Google Scholar

[161] View Article

[162] Google Scholar

[ref42] 42. Lynch M, Ackerman MS, Gout J-F, Long H, Sung W, Thomas WK, et al. Genetic drift, selection and the evolution of the mutation rate. Nat Rev Genet. 2016;17(11):704–14. pmid:27739533
View Article
PubMed/NCBI
Google Scholar

[164] View Article

[165] PubMed/NCBI

[166] Google Scholar

[ref43] 43. Zhang H, Wang Y, Wu X, Tang X, Wu C, Lu J. Determinants of genome-wide distribution and evolution of uORFs in eukaryotes. Nat Commun. 2021;12(1):1076. pmid:33597535
View Article
PubMed/NCBI
Google Scholar

[168] View Article

[169] PubMed/NCBI

[170] Google Scholar

[ref44] 44. Figuet E, Nabholz B, Bonneau M, Mas Carrio E, Nadachowska-Brzyska K, Ellegren H, et al. Life history traits, protein evolution, and the nearly neutral theory in amniotes. Mol Biol Evol. 2016;33(6):1517–27. pmid:26944704
View Article
PubMed/NCBI
Google Scholar

[172] View Article

[173] PubMed/NCBI

[174] Google Scholar

[ref45] 45. Waples RS. Life-history traits and effective population size in species with overlapping generations revisited: the importance of adult mortality. Heredity (Edinb). 2016;117(4):241–50. pmid:27273324
View Article
PubMed/NCBI
Google Scholar

[176] View Article

[177] PubMed/NCBI

[178] Google Scholar

[ref46] 46. Zhang J, Qian W. Functional synonymous mutations and their evolutionary consequences. Nat Rev Genet. 2025;26(11):789–804. pmid:40394196
View Article
PubMed/NCBI
Google Scholar

[180] View Article

[181] PubMed/NCBI

[182] Google Scholar

[ref47] 47. Alvarez-Ponce D, Krishnamurthy S. Organismal complexity strongly correlates with the number of protein families and domains. Proc Natl Acad Sci U S A. 2025;122(5):e2404332122. pmid:39874285
View Article
PubMed/NCBI
Google Scholar

[184] View Article

[185] PubMed/NCBI

[186] Google Scholar

[ref48] 48. Chen W, Jia Q, Song Y, Fu H, Wei G, Ni T. Alternative polyadenylation: methods, findings, and impacts. Genomics Proteomics Bioinformatics. 2017;15(5):287–300. pmid:29031844
View Article
PubMed/NCBI
Google Scholar

[188] View Article

[189] PubMed/NCBI

[190] Google Scholar

[ref49] 49. Wu G, Schmid M, Jensen TH. 3’ End sequencing of pA+ and pA- RNAs. Methods Enzymol. 2021;655:139–64. pmid:34183119
View Article
PubMed/NCBI
Google Scholar

[192] View Article

[193] PubMed/NCBI

[194] Google Scholar

[ref50] 50. Hoque M, Ji Z, Zheng D, Luo W, Li W, You B, et al. Analysis of alternative cleavage and polyadenylation by 3’ region extraction and deep sequencing. Nat Methods. 2013;10(2):133–9. pmid:23241633
View Article
PubMed/NCBI
Google Scholar

[196] View Article

[197] PubMed/NCBI

[198] Google Scholar

[ref51] 51. Jan CH, Friedman RC, Ruby JG, Bartel DP. Formation, regulation and evolution of Caenorhabditis elegans 3’UTRs. Nature. 2011;469(7328):97–101. pmid:21085120
View Article
PubMed/NCBI
Google Scholar

[200] View Article

[201] PubMed/NCBI

[202] Google Scholar

[ref52] 52. Shepard PJ, Choi E-A, Lu J, Flanagan LA, Hertel KJ, Shi Y. Complex and dynamic landscape of RNA polyadenylation revealed by PAS-Seq. RNA. 2011;17(4):761–72. pmid:21343387
View Article
PubMed/NCBI
Google Scholar

[204] View Article

[205] PubMed/NCBI

[206] Google Scholar

[ref53] 53. Herrmann CJ, Schmidt R, Kanitz A, Artimo P, Gruber AJ, Zavolan M. PolyASite 2.0: a consolidated atlas of polyadenylation sites from 3’ end sequencing. Nucleic Acids Res. 2020;48(D1):D174–9. pmid:31617559
View Article
PubMed/NCBI
Google Scholar

[208] View Article

[209] PubMed/NCBI

[210] Google Scholar

[ref54] 54. Ye W, Lian Q, Ye C, Wu X. A survey on methods for predicting polyadenylation sites from DNA sequences, bulk RNA-seq, and single-cell RNA-seq. Genomics Proteomics Bioinformatics. 2023;21(1):67–83. pmid:36167284
View Article
PubMed/NCBI
Google Scholar

[212] View Article

[213] PubMed/NCBI

[214] Google Scholar

[ref55] 55. Arefeen A, Liu J, Xiao X, Jiang T. TAPAS: tool for alternative polyadenylation site analysis. Bioinformatics. 2018;34(15):2521–9. pmid:30052912
View Article
PubMed/NCBI
Google Scholar

[216] View Article

[217] PubMed/NCBI

[218] Google Scholar

[ref56] 56. Chen M, Ji G, Fu H, Lin Q, Ye C, Ye W, et al. A survey on identification and quantification of alternative polyadenylation sites from RNA-seq data. Brief Bioinform. 2020;21(4):1261–76. pmid:31267126
View Article
PubMed/NCBI
Google Scholar

[220] View Article

[221] PubMed/NCBI

[222] Google Scholar

[ref57] 57. Tang P, Yang Y, Li G, Huang L, Wen M, Ruan W, et al. Alternative polyadenylation by sequential activation of distal and proximal PolyA sites. Nat Struct Mol Biol. 2022;29(1):21–31. pmid:35013598
View Article
PubMed/NCBI
Google Scholar

[224] View Article

[225] PubMed/NCBI

[226] Google Scholar

[ref58] 58. Adiconis X, Haber AL, Simmons SK, Levy Moonshine A, Ji Z, Busby MA, et al. Comprehensive comparative analysis of 5’-end RNA-sequencing methods. Nat Methods. 2018;15(7):505–11. pmid:29867192
View Article
PubMed/NCBI
Google Scholar

[228] View Article

[229] PubMed/NCBI

[230] Google Scholar

[ref59] 59. Cass AA, Xiao X. mountainClimber Identifies Alternative Transcription Start and Polyadenylation Sites in RNA-Seq. Cell Syst. 2019;9(4):393-400.e6. pmid:31542416
View Article
PubMed/NCBI
Google Scholar

[232] View Article

[233] PubMed/NCBI

[234] Google Scholar

[ref60] 60. Qin Z, Stoilov P, Zhang X, Xing Y. SEASTAR: systematic evaluation of alternative transcription start sites in RNA. Nucleic Acids Res. 2018;46(8):e45. pmid:29546410
View Article
PubMed/NCBI
Google Scholar

[236] View Article

[237] PubMed/NCBI

[238] Google Scholar

[ref61] 61. Gacita AM, Dellefave-Castillo L, Page PGT, Barefield DY, Wasserstrom JA, Puckelwartz MJ, et al. Altered enhancer and promoter usage leads to differential gene expression in the normal and failed human heart. Circ Heart Fail. 2020;13(10):e006926. pmid:32993371
View Article
PubMed/NCBI
Google Scholar

[240] View Article

[241] PubMed/NCBI

[242] Google Scholar

[ref62] 62. Pickrell JK, Pai AA, Gilad Y, Pritchard JK. Noisy splicing drives mRNA isoform diversity in human cells. PLoS Genet. 2010;6(12):e1001236. pmid:21151575
View Article
PubMed/NCBI
Google Scholar

[244] View Article

[245] PubMed/NCBI

[246] Google Scholar

[ref63] 63. Pertea M, Pertea GM, Antonescu CM, Chang T-C, Mendell JT, Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol. 2015;33(3):290–5. pmid:25690850
View Article
PubMed/NCBI
Google Scholar

[248] View Article

[249] PubMed/NCBI

[250] Google Scholar

[ref64] 64. Zhang J, Yang J-R. Determinants of the rate of protein sequence evolution. Nat Rev Genet. 2015;16(7):409–20. pmid:26055156
View Article
PubMed/NCBI
Google Scholar

[252] View Article

[253] PubMed/NCBI

[254] Google Scholar

[ref65] 65. Xu C, Zhang J. A different perspective on alternative cleavage and polyadenylation. Nat Rev Genet. 2020;21(1):63. pmid:31745293
View Article
PubMed/NCBI
Google Scholar

[256] View Article

[257] PubMed/NCBI

[258] Google Scholar

[ref66] 66. Lynch M. Evolution of the mutation rate. Trends Genet. 2010;26(8):345–52. pmid:20594608
View Article
PubMed/NCBI
Google Scholar

[260] View Article

[261] PubMed/NCBI

[262] Google Scholar

[ref67] 67. Schad E, Tompa P, Hegyi H. The relationship between proteome size, structural disorder and organism complexity. Genome Biol. 2011;12(12):R120. pmid:22182830
View Article
PubMed/NCBI
Google Scholar

[264] View Article

[265] PubMed/NCBI

[266] Google Scholar

[ref68] 68. Chen L, Bush SJ, Tovar-Corona JM, Castillo-Morales A, Urrutia AO. Correcting for differential transcript coverage reveals a strong relationship between alternative splicing and organism complexity. Mol Biol Evol. 2014;31(6):1402–13. pmid:24682283
View Article
PubMed/NCBI
Google Scholar

[268] View Article

[269] PubMed/NCBI

[270] Google Scholar

[ref69] 69. Xu G, Zhang J. Human coding RNA editing is generally nonadaptive. Proc Natl Acad Sci U S A. 2014;111(10):3769–74. pmid:24567376
View Article
PubMed/NCBI
Google Scholar

[272] View Article

[273] PubMed/NCBI

[274] Google Scholar

[ref70] 70. Xu C, Zhang J. Mammalian alternative translation initiation is mostly nonadaptive. Mol Biol Evol. 2020;37(7):2015–28. pmid:32145028
View Article
PubMed/NCBI
Google Scholar

[276] View Article

[277] PubMed/NCBI

[278] Google Scholar

[ref71] 71. Sun M, Zhang J. Preferred synonymous codons are translated more accurately: Proteomic evidence, among-species variation, and mechanistic basis. Sci Adv. 2022;8(27):eabl9812. pmid:35857447
View Article
PubMed/NCBI
Google Scholar

[280] View Article

[281] PubMed/NCBI

[282] Google Scholar

[ref72] 72. Li C, Zhang J. Stop-codon read-through arises largely from molecular errors and is generally nonadaptive. PLoS Genet. 2019;15(5):e1008141. pmid:31120886
View Article
PubMed/NCBI
Google Scholar

[284] View Article

[285] PubMed/NCBI

[286] Google Scholar

[ref73] 73. Li W, Baehr S, Marasco M, Reyes L, Brister D, Pikaard CS, et al. A narrow range of transcript-error rates across the Tree of Life. Sci Adv. 2025;11(28):eadv9898. pmid:40644547
View Article
PubMed/NCBI
Google Scholar

[288] View Article

[289] PubMed/NCBI

[290] Google Scholar

[ref74] 74. Cunningham F, Allen JE, Allen J, Alvarez-Jarreta J, Amode MR, Armean IM, et al. Ensembl 2022. Nucleic Acids Res. 2022;50(D1):D988–95. pmid:34791404
View Article
PubMed/NCBI
Google Scholar

[292] View Article

[293] PubMed/NCBI

[294] Google Scholar

[ref75] 75. Sayers EW, Beck J, Bolton EE, Bourexis D, Brister JR, Canese K, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2021;49(D1):D10–7. pmid:33095870
View Article
PubMed/NCBI
Google Scholar

[296] View Article

[297] PubMed/NCBI

[298] Google Scholar

[ref76] 76. Hinchliff CE, Smith SA, Allman JF, Burleigh JG, Chaudhary R, Coghill LM, et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proc Natl Acad Sci U S A. 2015;112(41):12764–9. pmid:26385966
View Article
PubMed/NCBI
Google Scholar

[300] View Article

[301] PubMed/NCBI

[302] Google Scholar

[ref77] 77. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32(5):1792–7. pmid:15034147
View Article
PubMed/NCBI
Google Scholar

[304] View Article

[305] PubMed/NCBI

[306] Google Scholar

[ref78] 78. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25(15):1972–3. pmid:19505945
View Article
PubMed/NCBI
Google Scholar

[308] View Article

[309] PubMed/NCBI

[310] Google Scholar

[ref79] 79. Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16(1):157. pmid:26243257
View Article
PubMed/NCBI
Google Scholar

[312] View Article

[313] PubMed/NCBI

[314] Google Scholar

[ref80] 80. Rice P, Longden I, Bleasby A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000;16(6):276–7. pmid:10827456
View Article
PubMed/NCBI
Google Scholar

[316] View Article

[317] PubMed/NCBI

[318] Google Scholar

[ref81] 81. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91. pmid:17483113
View Article
PubMed/NCBI
Google Scholar

[320] View Article

[321] PubMed/NCBI

[322] Google Scholar

[ref82] 82. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884–90. pmid:30423086
View Article
PubMed/NCBI
Google Scholar

[324] View Article

[325] PubMed/NCBI

[326] Google Scholar

[ref83] 83. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29(1):15–21. pmid:23104886
View Article
PubMed/NCBI
Google Scholar

[328] View Article

[329] PubMed/NCBI

[330] Google Scholar

[ref84] 84. Liao Y, Smyth GK, Shi W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics. 2014;30(7):923–30. pmid:24227677
View Article
PubMed/NCBI
Google Scholar

[332] View Article

[333] PubMed/NCBI

[334] Google Scholar

[ref85] 85. Wagner GP, Kin K, Lynch VJ. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 2012;131(4):281–5. pmid:22872506
View Article
PubMed/NCBI
Google Scholar

[336] View Article

[337] PubMed/NCBI

[338] Google Scholar

[ref86] 86. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9. pmid:22388286
View Article
PubMed/NCBI
Google Scholar

[340] View Article

[341] PubMed/NCBI

[342] Google Scholar

[ref87] 87. Lizio M, Harshbarger J, Shimoji H, Severin J, Kasukawa T, Sahin S, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015;16(1):22. pmid:25723102
View Article
PubMed/NCBI
Google Scholar

[344] View Article

[345] PubMed/NCBI

[346] Google Scholar

[ref88] 88. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 2015;12(4):357–60. pmid:25751142
View Article
PubMed/NCBI
Google Scholar

[348] View Article

[349] PubMed/NCBI

[350] Google Scholar

[ref89] 89. Haberle V, Forrest ARR, Hayashizaki Y, Carninci P, Lenhard B. CAGEr: precise TSS data retrieval and high-resolution promoterome mining for integrative analyses. Nucleic Acids Res. 2015;43(8):e51. pmid:25653163
View Article
PubMed/NCBI
Google Scholar

[352] View Article

[353] PubMed/NCBI

[354] Google Scholar

Transcript diversity reflects deleterious RNA processing errors shaped by population size in metazoans

Transcript diversity reflects deleterious RNA processing errors shaped by population size in metazoans

Update

Figures

Abstract

Introduction

Results

Genomic and transcriptomic data

N_e and proxies

Interspecific variation in transcript diversity caused by APA

Interspecific variation in transcript diversity caused by ATI

Interspecific variation in transcript diversity caused by AS

Discussion

Materials and methods

Genomes, transcriptomes, and N_e estimates

Phylogenetic tree

ω estimation

RNA-seq data processing

APA data processing

ATI data processing

AS data processing

Data analysis

Supporting information

S1 Fig. Analysis of transcript diversity caused by APA (measured by total percentage usage of minor APA sites) using 3′-end-seq and RNA-seq.

S2 Fig. Analysis of transcript diversity caused by ATI (measured by the total percentage usage of minor ATI sites) using CAGE-seq and RNA-seq.

S3 Fig. Analysis of transcript diversity caused by AS (measured by the total percentage usage of splicing junctions in minor RNA splicing isoforms).

S4 Fig. Correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after removing 9 model species (Homo sapiens, Macaca mulatta, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, Drosophila melanogaster, Aedes aegypti, and Caenorhabditis elegans).

S5 Fig. Partial correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after the control for genome size and complexity.

S6 Fig. Partial correlation between transcript diversity caused by AS and the number of cell types after the control for ω.

S1 Data. Available datasets of transcriptomes and genomes for the species investigated.

S2 Data. N_e of 26 metazoans reported in previous studies.

S3 Data. N_e and proxies.

S4 Data. Twelve species with available information about the number of cell types.

Acknowledgments

References

Update

Figures

Abstract

Introduction

Results

Genomic and transcriptomic data

Ne and proxies

Interspecific variation in transcript diversity caused by APA

Interspecific variation in transcript diversity caused by ATI

Interspecific variation in transcript diversity caused by AS

Discussion

Materials and methods

Genomes, transcriptomes, and Ne estimates

Phylogenetic tree

ω estimation

RNA-seq data processing

APA data processing

ATI data processing

AS data processing

Data analysis

Supporting information

S1 Fig. Analysis of transcript diversity caused by APA (measured by total percentage usage of minor APA sites) using 3′-end-seq and RNA-seq.

S2 Fig. Analysis of transcript diversity caused by ATI (measured by the total percentage usage of minor ATI sites) using CAGE-seq and RNA-seq.

S3 Fig. Analysis of transcript diversity caused by AS (measured by the total percentage usage of splicing junctions in minor RNA splicing isoforms).

S4 Fig. Correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after removing 9 model species (Homo sapiens, Macaca mulatta, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, Drosophila melanogaster, Aedes aegypti, and Caenorhabditis elegans).

S5 Fig. Partial correlation between transcript diversity caused by APA (A), ATI (B), and AS (C) and ω after the control for genome size and complexity.

S6 Fig. Partial correlation between transcript diversity caused by AS and the number of cell types after the control for ω.

S1 Data. Available datasets of transcriptomes and genomes for the species investigated.

S2 Data. Ne of 26 metazoans reported in previous studies.

S3 Data. Ne and proxies.

S4 Data. Twelve species with available information about the number of cell types.

Acknowledgments

References

N_e and proxies

Genomes, transcriptomes, and N_e estimates

S2 Data. N_e of 26 metazoans reported in previous studies.

S3 Data. N_e and proxies.