A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias

We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.


Introduction
Over the past decade, a feeling that there is a crisis in experimental science has increasingly permeated the thinking of methodologists, captains of industry, working scientists, and even the lay public [1][2][3][4][5][6]. This manifests in poor statistical power to find true effects [7,8] Multinomial hierarchical logistic regression indicates that most differential expression (DE) analysis tools exhibit temporal increases of anti-conservative p-value histograms, except for cuffdiff, which has the opposite trend, and glc genomics and deseq, where a clear trend could not be identified (Figs 2A and S3). The increase in the fraction of anti-conservative histograms is accomplished by decreases mainly in the class "other" and "bimodal," depending on the DE analysis tool. This positive temporal trend in anti-conservative p-value histograms suggests improving quality of the HT-seq DE field. Somewhat surprisingly, Fig 2A also indicates that different DE analysis tools are associated with very different proportions of p-value histogram classes, suggesting that the quality of p-value calculation, and therefore, the quality of scientific inferences based on the p-values, depends on the DE analysis tool. We further tested this conjecture in a simplified model, restricting our analysis to 2018 to 2020, the final years in our dataset ( Fig 2B). As no single DE analysis tool dominates the field (the top 5 are DESeq2 28%, cuffdiff 27%, edgeR 14%, DESeq 8%, limma 2%; see S4 Fig for temporal trends), a situation where proportions of different p-value histogram classes do not substantially differ between analysis tools would indicate lack of tool-generated bias. However, we found that all p-value histogram classes, except "uniform," which is mainly unpopulated, depend strongly on the DE analysis tool (Figs 2B, S3, and S5). This effect is quite extreme, extending from nearly zero fraction in cuffdiff to about 0.75 in Sleuth. Using the whole dataset of 14,813 p-value histograms-as a check for robustness of results-or adjusting the analysis for GEO publication year, of the taxon (human, mouse, and pooled other), of the RNA source or sequencing platform-as a check for possible confounding-does not change this conclusion (S5B- S5F Fig). The lack of confounding in our results allows a causal interpretation, indicating that DE analysis tools bias the analysis of HT-seq experiments, while the large DE analysis platform-dependent differences suggest a very substantial bias [28].

The proportion of true null hypotheses
To further enquire into DE analysis tool-driven bias, we estimated from user-submitted p-values the fraction of true null effects (the π 0 ) for each HT-seq experiment. The π 0 is a statistic calculated solely from the p-values, and it is routinely used as an intermediate quantity needed to fix FDR at the desired level [29]. The quality of FDR control depends on the quality of Shaded areas denote 95% credible regions. N = 4,616. The data file is in S1 Data (B) Association of p-value histogram type with DE analysis tool; data is restricted to 2018-2020 GEO submissions. Points denote best fit of the model [n | trials(total in de_tool)~class + de_tool + class:de_tool, binomial likelihood]. Thick and thin lines denote 66% and 95% credible intervals, respectively. N = 2,930. The model object related to panel A can be downloaded from https://gin.gnode.org/tpall/geo-htseq-paper/src/v0.1/models/Class_year__year_detool_year.rds. The model object related to panel B can be downloaded from https://gin.gnode.org/tpall/geo-htseq-paper/src/v0. 2 estimation of the π 0 , which in turn depends on the quality of underlying p-values. Thus, the π 0 can be seen as an estimate of the true proportion of true nulls or as a single number summary of the p-value distribution, and in the latter capacity, it can be used as a quality check of the pvalues used for FDR control of the experiment.
As non-anti-conservative sets of p-values (excepting the "uniform") already indicate low quality of both the π 0 estimate, and of the ensuing FDR control [30], we only calculated the π 0 for datasets with anti-conservative and uniform p-value distributions (N = 1,188). Nevertheless, the π 0 -s show an extremely wide distribution, ranging from 0.999 to 0.02. Remarkably, 37% of the π 0 values are smaller than 0.5, meaning that, according to the calculated p-values, in those experiments, over half of the features are estimated to change their expression levels upon experimental treatment ( Fig 3A). Conversely, only 23% of π 0 -s exceed 0.8, and 9.9% exceed 0.9. The peak of the π 0 distribution is not near 1, as might be expected from experimental design considerations, but there is a broad peak between 0.5 and 0.8 (median and mean π 0s are both at 0.59). Depending on the DE analysis tool, the mean π 0 -s range over 20 percentage points, from about 0.45 to 0.65 ( Fig 3B, see S6A Fig for all DE analysis tools). Using the whole dataset confirms the robustness of this analysis (S6B Fig.).
In terms of experimental design, to get a low π 0 an experiment would have to change the expression of most genes under study substantially. A significant source of such experiments would be comparisons of different cancer cell lines/tissues, where π 0 = 0.4 could be considered a reasonable outcome [31]. We, therefore compared the π 0 -s coming from GEO HT-seq indicates an association of π 0 with the DE analysis tool. Points denote best estimates for the mean π 0 and thick and thin lines denote 66% and 95% credible intervals, respectively. N = 1,188. The data file is in S4 Data. (C) Histogram of π 0 values in GEO cancer studies compared to non-cancer studies. The data file is in S5 Data. (D) Histogram of π 0 values in GEO transcription factor studies compared to non-TF studies. The data file is in S6 Data. The model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/ src/v0.1/models/pi0_detool_sample.rds.
https://doi.org/10.1371/journal.pbio.3002007.g003 submissions related to search terms "neoplasms" or "cancer" (for the exact query string, please see Methods) with all other non-cancer submissions. There is very little difference in the means and standard deviations of π 0 -s for cancer and non-cancer experiments (0.58 (0.22) and 0.59 (0.24), respectively) ( Fig 3C). Filtering by studies mentioning transcription factor led to very similar results (Fig 3D), suggesting that the wide dispersion of π 0 -s is not caused by intentional experimental designs. Also, studies involving cancer and TFs resulted in anti-conservative or uniform p-value distributions with similar probabilities to non-cancer/non-TF studies (risk ratios with 95% CI are 0.95 (0.83; 1.07) and 0.99 (0.74; 1.28), respectively).
In addition, controlling for time, taxon, or sequencing platform, did not substantially change the association of DE analysis tools with the π 0 -s (S6C- S6F Fig). Recalculating the π 0 -s with a global FDR algorithm [32] did not lead to substantially different π 0 distribution, although there appears to be a slight directional shift of the distribution, the mean π 0 is shifted from 0.58 in the local FDR method to 0.54 in the global FDR method (S7 Fig.). As there is a strong association between both π 0 and the proportion of anti-conservative p-value histograms with the DE analysis tool (Figs 2 and 3), we further checked for and failed to see similar associations with variables from raw sequence metadata, such as the sequencing platform, library preparation strategies, library sequencing strategies, library selection strategies, and library layout (single or paired) (S8-S15 Fig). These results support the conjecture of specificity of the associations with DE analysis tools.

The sample size distribution of DE-HT experiments indicates persistently low power
We assigned sample sizes to 2,393 GEO submissions (see Materials for details). From these, 91% had sample sizes of 4 or less, 25% had N of just 2 and 12% used a single sample from which to calculate p-values, thereby entirely ignoring biological variation ( Fig 4A). Only 1% of experiments had sample sizes over 10.
Are the observed sample sizes sufficient to convey reasonable power to DE RNA-seq experiments? Simulations on idealized data with natural variation and effect sizes lead N = 2 experiments to 30% to 60% power and N = 3 experiments to 50% to 70% power for human/mouse settings (shown as shades of blue coloring) and 2 different biological variation settings ("Gilad" corresponds to human liver samples and "Bottomly" to inbred mice; [33]). The data file is in S8 Data.
https://doi.org/10.1371/journal.pbio.3002007.g004 experiments, where higher π 0 -s are associated with lower power (Fig 4B). The power of reallife experiments is likely to be even lower, due to imperfect match between the data and the statistical model used to analyze it [34]. Thus, assuming that realistic levels of biological variation are captured in the samples, most DE RNA-seq experiments conducted in eukaryotic systems must be underpowered. This conclusion is consistent with the methodological literature (see Discussion).

A low sample size lowers the probability of obtaining anti-conservative pvalue distributions
There is a clear trend, shown in Fig 5A, whereby very small samples of 5 or fewer lead to progressively smaller fractions of anti-conservative p-value distributions. While the probability of obtaining an anti-conservative p-value distribution is only around 0.1, if the sample size is one, it increases to 0.3 for sample size 3 and further to about 0.5 for sample size 5. We could detect no further increase in this probability as sample size grew further (but note that the vast majority of experiments have sample sizes of 4 or less). As our simulations indicate that reasonably powered (power is around 0.7 to 0.8) experiments start around sample sizes 4 to 5 for multicellular systems, this result suggests that the existing DE analysis algorithms may work poorly on data obtained from underpowered experiments.
Experiments that did not result in anti-conservative p-value distributions were less likely to have samples larger than 3 and more likely to have sample sizes of 2 and 1 than the experiments that led to anti-conservative p-value distributions (Fig 5B). This effect was especially pronounced for experiments with conservative p-value distributions, of which 52% had sample sizes of 1 or 2, as compared with 26% for anti-conservative p-value distributions. Overall, the experiments resulting in anti-conservative p-value distributions have a slightly larger mean sample size (3.5 versus 2.7; p = 10 −9 ). Points denote best estimates for the mean and thick and thin lines denote 66% and 95% credible intervals, respectively. The data file is in S10 Data. The model object related to panel A can be downloaded from https://gin.g-node. org/tpall/geo-htseq-paper/src/v0.1/models/anticons__N.rds. The model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/n%20%7c%20trials%28nn%29__Class%20+%20Nb%20+% 20Class:Nb.rds. https://doi.org/10.1371/journal.pbio.3002007.g005 These results indicate a clear causal link between sample size and technical quality of most current workflows of DE analysis. Thus, a low power not only leads to missed discoveries and overestimated DE-s, as predicted by theory [35], but it also seems to be destructive towards the algorithms that calculate these DE-s.

Lack of association of sample size with the proportion of true nulls
According to the statistical theory, there is a linear dependence of the power of the experiment on the square root of sample size. An underpowered experiment results in a smaller number of near-zero p-values, leading to overestimation of the π 0 at small N-s, especially at N = 2, as confirmed by simulation ( Fig 6A). Thus, reducing the power will ultimately convert an anti-conservative p-value distribution into a uniform one. However, we not only have very few uniform p-value distributions in our empirical data but also few near-one estimated π 0 -s that could be construed as manifestations of low power ( Fig 3A). We looked into this further by plotting the π 0 values versus sample size ( Fig 6B). As a result, instead of the expected decrease of the mean π 0 upon increasing the sample size, we see a slight increase of π 0 -s at sample sizes >5, while the overall correlation is negligible (r = 0.06; 95% CI [0.007, 0.12]). Thus, there does not appear to be a sample size-dependent bias in π 0 -s, as estimated from the GEO-submitted p-value distributions. As the true power of most small sample DE RNA-seq experiments is expected to be low, this result leads us to question further the statistical adequacy of the underlying p-values, of which the π 0 -s are but single-number summaries. These results show that statistical methods used in HT-seq DE analysis tend to produce highly suspect output even when the p-value distribution is anti-conservative.

Curing of p-value histograms by removing low-count features
We observed a slight reduction in the mean number of p-values per GEO experiment of anticonservative and uniform histograms compared to other p-value histogram classes, suggesting that p-value sets with anti-conservative and uniform shapes are more likely to have been prefiltered or have been more extensively pre-filtered (S1 Fig). Accordingly, we speculated that by further filtering out features with low counts, we could convert some of the untoward p-value Calculated π 0 -s (on the y-axis) from simulated data vs. given "true" proportions of DE-features (on the x-axes). Sample sizes are indicated in color code. The dotted line shows the perfect correspondence between the given π 0 -s and the estimated π 0 . The data file is in S11 Data. (B) Dependence of mean π 0 from the binned sample sizes with 95% CI. Robust linear model [pi0~N], Student's likelihood. Points denote best estimates for the mean and thick and thin lines denote 66% and 95% credible intervals, respectively. The data file is in S12 Data. Model object related to panel B can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/pi0%20~%20N.rds.
https://doi.org/10.1371/journal.pbio.3002007.g006 histograms into anti-conservative or uniform types. Our goal was not to provide optimal interventions for individual datasets, which would require tailoring the filtering algorithm for the requirements of a particular experiment. We merely aim to provide proof of principle that by a simple filtering approach, we can increase the proportion of anti-conservative p-value sets and/or reduce the dependence of results on the analysis platform. Therefore, we applied filtering to 3,426 p-value sets where we could identify gene expression values (see Methods for details on selecting filtering thresholds).
We found that overall we could increase the proportion of anti-conservative p-value histograms by 2.4-fold, from 844 (24.6%) to 2,022 (59.0%), and the number of uniform histograms by 2.5-fold from 8 (0.23%) to 20 (0.6%) (Fig 7A). For all analysis platforms, most rescued pvalue distributions came from classes "bimodal" and "other," while almost no rescue was detected from conservative histograms (Fig 7B-7F). After removing low count features, the proportion of anti-conservative p-value histograms increased for all analysis platforms, with the largest effects observed for cuffdiff and deseq, which presented the lowest pre-rescue fractions of anti-conservative p-value histograms (Fig 7G and 7H). Nevertheless, substantial differences between analysis platforms remain, indicating that the removal of low-count features, while generally beneficial, was insufficient to entirely remove the sources of bias originating from the analysis platform. Also, the π 0 -s calculated from the rescued anti-conservative pvalue sets have very similar distributions compared to π 0 -s from the pre-rescue anti-conservative p-value sets and concomitantly very similar dependence on the analysis platform ( Fig 7I  and 7J).

Discussion
In this work, we assess the quality of the differential expression analysis by HT-seq based on a large, unbiased NCBI GEO dataset. Our goal was to study real-world statistical inferences made by working scientists. Thus, we study how experimental design choices and analytic decisions of scientists affect the quality of their statistical inferences, with the understanding that in a field where each experiment encompasses ca. 20,000 parallel measurements of DE on average, the quality of statistical inference is highly relevant to the quality of the scientific inference. To the best of our knowledge, this is the first large-scale study to offer quantitative insight into the general quality of experimentation and data analysis of a large field of biomedical science.
We show that i. Overall, three-quarters of HT-seq DE experiments result in p-value distributions that indicate that assumptions behind the DE tests have not been met. However, for many experiments, a simple exclusion of low-count features rescues the p-value distribution.
ii. Very few experiments result in uniform p-value distributions that indicate <100 DE features.
iii. The sample sizes of the vast majority of HT-seq DE experiments are inconsistent with reasonably high statistical power to detect true effects.
iv. Nevertheless, the distribution of π 0 -s, the fraction of true null hypotheses in an experiment, which is estimated solely from the p-value distributions of experiments presenting anticonservative p-value distributions, peaks at around 0.5, as if in many experiments, most genes were DE. Furthermore, the estimated π 0 -s do not correlate with the sample sizes of the experiments, indicating that the underlying p-value sets are problematic for controlling FDR. The data files are in S13 Data and in S14 Data (for raw data). (H) Effect sizes in percentage points of low-count feature filtering to the proportion of anti-conservative p-value histograms. The data files are in S13 Data and in S14 Data (for raw data). (I) Posterior summaries of π 0 values of p-value histograms in raw and filtered p-value sets. Filtered p-value data is the p-value from the beta model [pi0~de_tool], N = 2,042. The data files are in S15 Data and in S16 Data (for raw data). (J) Effect sizes in π 0 units (percentage points) of low-count feature filtering to π 0 . The data files are in S15 Data and in S16 Data (for raw data). The model object v. The proportion of anti-conservative p-value distributions and the values of π 0 -s are strongly associated with the DE analysis platform.
These results show that not only are most p-value sets statistically suspect (as shown by the preponderance of untoward p-value distributions), but that even the well-behaved anti-conservative p-value sets tend to lead to incompatible statistical inferences with the observed low sample sizes and therefore, low power. Furthermore, the observed association with data analysis platforms of both the p-value distributional classes and the π 0 -s of anti-conservative p-value sets strongly indicates that DE analysis results in the literature are substantially biased by the analysis platform used.
From our meta-science study, we can conclude that the actual use of statistical tools in the HT-seq DE field is inconsistent with a high quality of statistical inferences. While our results clearly show the pervasiveness of the problem, they by themselves cannot answer the question: how much does it matter downstream for the quality of scientific inferences? However, recent methodological work on individual workflows allows us to provide an answer: it matters a lot. Namely, from simulation studies of existing workflows, it is now becoming clear that all popular p-value-based methods that can be used with small samples often fail in the one thing that they were really created for, in FDR control [30,34,36]. Our study sees the results of this widespread failure to control FDR in the high prevalence of low π 0 estimates, whose lack of dependence on sample size defies the logic of statistical inference, thereby confirming that the poor FDR control observed in simulations is massively carried over into results of actual HT-seq DE data analysis. Collectively, our study and the aforementioned simulation studies indicate that the field is largely built on analyzing low-power experiments, which are unlikely to identify actual effects, but still tend to overestimate the effect sizes of both true and false DE-s of any statistically significant results [35], while presenting an unknowable number of false discoveries as statistically significant. It is hard to imagine a worse state of affairs or that this would not substantially affect the quality of relevant scientific conclusions.
In the following, we will further discuss specific aspects of our work.

The meaning of the p-value distribution
The first thing to notice with the p-value distributional classes is that there are extremely few uniform p-value distributions, suggesting relatively few actual effects. This unexpected result is made even more surprising by the low statistical power (<40%) of most real-world RNA-seq DE experiments, which should increase the likelihood of encountering uniform p-value distributions [37]. As a technical comment, it should be noted that the class assigned by us on a p-value histogram depends on an arbitrarily set bin size. Our use of 40 bins leads to histograms, where an experiment with up to around 100 true effects out of 20,000 features could well lead to a uniform histogram because of swamping of the lowermost bin (p < 0.025) with p-values emanating from true null effects (S17 Fig). If the p-values had been calculated correctly, then the lack of uniform p-value distributions would indicate that (i) there are almost no experiments submitted to GEO, where the experimental treatment led to only a few DE genes and (ii) that the actual power of GEO-submitted experiments is not low at all. We find both possibilities hard to believe. While there is a positive temporal trend for the increasing fraction of anti-conservative pvalue sets, overall, a substantial majority of them fall into shapes indicating that the related to filtered p-value sets in panel G can be downloaded from https://gin.g-node.org/tpall/geo-htseq-paper/src/v0.1/models/ anticons_detool_filtered.rds. The model object related to filtered p-value sets in the panel I can be downloaded from https://gin.g-node. org/tpall/geo-htseq-paper/src/v0.1/models/pi0_detool_full_data_filtered.rds. See S16 Fig for all  assumptions behind the statistical tests, which produced these p-values, have not been met. Expressly, p-value distributions that are not anti-conservative/uniform relinquish statistical control of the experiment over type I errors, on the presence of which later p-value adjustments are predicated [21]. Consequently, such p-value sets have effectively relinquished their statistical meaning and are thus problematic for scientific interpretation and further analysis, including FDR/q value calculations. Specifically, conservative p-value histograms are thought to be caused by small samples, indicative of reduced power and overly conservative control of FDR [30]. The conjecture of low power is supported by our data in that experiments with conservative p-value distributions are the most likely to have sample sizes of 1 or 2 (Fig 5B).
In contrast, anti-conservative p-value distributions that have more than expected small pvalues, and thus led to reduced π 0 estimates in our analysis, tend to lead to loss of FDR control [30]. In general, even a small deviation from the uniformity of the p-value distribution of true nulls is expected to throw off the overall FDR control, which, alas, is the main reason for using these methods for the statistical analysis of DE [30]. In fact, without FDR control, any statistical method that converts continuous effects into statistical significance through binary decisions can do more harm than good [11].
Currently, most DE analysis platforms use parametric tests accompanied by GLM regression. However, it has been recently shown for the often-used parametric tests that their FDR control can be exceedingly poor, while the nonparametric Wilcoxon rank sum test achieves better results [34]. Furthermore, it is becoming clear that data normalization can introduce bias into HT-seq DE analysis, which cannot be corrected by many of the currently widely used methods [38,39]. Indeed, widely used DE analysis tools use RNA-seq data preprocessing strategies, which are vulnerable to situations where a significant fraction of features change expression, and different samples exhibit differing total RNA concentrations [25,40].
In general, the analysis tools used for differential expression testing allow for a plethora of choices, including distributional models, data transformations, and basic analytic strategies, which can lead to different results through different trade-offs [41][42][43]. Our finding of a high proportion of unruly p-value distributions suggests that the wide variety of individual data preprocessing/DE testing workflows, in their actual use by working scientists, results in an overall poor outcome. Although this situation has been steadily improving over the last decade, at least for the limma, edgeR, and DESeq2 users, there remains substantial room for further improvements.

Possible causes for analysis platform-driven bias
What could cause the systematic differences between analysis platforms that we see in p-value distributions and π 0 values? The significant points of divergence in the analytic pathway include raw sequence preprocessing, aligning sequences to a reference sequence, counts normalization, and differential expression testing [42,44]. Our inability to abolish analysis platform-driven bias by removal of low-count features suggests that the pre-filtering of data is not a major source of this bias. Although in principle, any tool can be used in many workflows, different tools tend to offer their users different suggested workflows and different amounts of automation for data preprocessing and analysis. For example, Cuffdiff almost fully automates the analysis, including the removal of biases [45], DESeq2 workflow suggests some preprocessing steps to speed-up computations but uses automation to remove biases from input data [46], whereas edgeR [47] and limma-voom require more interactive execution of separate preprocessing and analysis steps [48,49]. We speculate that the popularity of Cuffdiff and DESeq2 partly lies in their automation, as the user is largely relieved from decision-making. However, we found that cuffdiff is associated with the smallest proportion of anti-conservative p-value histograms, whereas limma and edger, with their more hands-on approach, are associated with the highest proportions of anti-conservative histograms. Interestingly, limma and edgeR use very different distributional models for DE testing, supporting the notion that it might be data preprocessing rather than the statistical test that has the most impact on the success of DE analysis [25]. However, limma and edgeR were not the top performers on the π 0 -metric, where the highest performance is associated with Cuffdiff, whose ability to provide anti-conservative p-value distributions was the least impressive (note that the π 0 -s were calculated only from the experiments with well-distributed p-values).

Power considerations
The prevalent sample sizes of 2 to 3 in published HT-seq DE experiments are incompatible with our simulations and the literature in terms of providing enough power to the experiment to reliably find most DE genes. For example, with the currently favored sample size of 3, for most genes in isogenic yeast, effect sizes of at least 4-fold seem to be required for successful analysis, and the overall minimal acceptable sample size can be from 4 to 6 [37,50]. In animals, a reasonable sample size is expected to be substantially larger, over 10 for most genes, which are not highly expressed, and for cancer samples, it seems to be well over 20 [37,43,[51][52][53]. This raises the possibility that nearly all experiments, which should have resulted in uniform distributions, were somehow shifted into other distributional classes. The low power to detect DE of most genes also means that excluding most low-count genes (which means most genes) should be encouraged because it would increase the power to detect the highly expressed genes, while DE of lower-expressed genes would have been missed anyway. Of course, when the scientific goal is to study the DE of genes that are not highly expressed, there is no real alternative to substantially increasing the sample size.

Estimated π 0 -s suggest experimental design problems
The DE analysis tool-specific means of π 0 values range from 0.5 (tool "unknown") to 0.7 (tool "cuffdiff"), showing that by this criterion in an average RNA-seq experiment about half of the cellular RNA-s are expected to change their expression levels ( Fig 3B). In principle, a low π 0 could reflect true differential expression as a causal response to experimental treatment, or it could be an artefact of suboptimal data analysis. To further investigate this, we separately analyzed 2 cases of studies related either to cancer or transcription factors, assuming that available p-value sets reflect the situation where DE is profiled in cancer cells/tissues or after transcription factor interrogation, respectively. Due to large-scale rearrangements in cancer genomes and the ability of many TFs to change the expression of many genes, these studies are expected to lead to lower π 0 -s, perhaps in the 0.4 range [31]. However, in both cases, we saw essentially unchanging π 0 distributions, which suggests that the low π 0 -s in our dataset are indicative of problematic analytic workflows. This conclusion is further strengthened by the observed lack of negative correlations between π 0 and sample size, which is a proxy of statistical power.
It should be added that as most data normalization workflows assume that most genes are not DE (and that total RNA concentration are stable across experimental conditions), any study that results in a low π 0 should strive to explicitly address this fact in the experimental setup, data analysis, and interpretation of results. It has been argued that most genome-wide DE studies ever conducted, including by HT-seq, have used experimental designs that would make it impossible to untangle such global effects, at least in a quantitatively accurate way [54,55]. The issue lies in the use of internal standards in normalizing the counts for different RNA-s, which leads to wrong interpretations if most genes change expression in 1 direction. To overcome this problem, one could use spike-in RNA standards and compositional normalization [40]. However, spike-in normalization requires great care to work correctly in such extreme cases [39,56], and it is rarely used outside single-cell RNA-seq [40]. Thus, it seems likely that many or most of the low π 0 experiments represent technical failures, most likely during data normalization.

Considerations for improvement of best practices
Currently, at least 29 different DE RNA-seq analysis platforms exist, which, among other differences, use 12 different distributional models for DE testing [50], while it is becoming apparent that all popular distributional models have trouble achieving promised FDR levels [34]. At a minimum, a winnowing of recommended analytic choices is needed. Nevertheless, design and analysis of a specific DE RNA-seq experiment would have to be responsive to several factors, including (i) the study organism and its genetic background, pertaining to biological variation and thus sample size; (ii) the expected π 0 , pertaining to experimental design, like the use of spike-in probes, and data analysis; (iii) the number of genes of interest, pertaining to sample size through the multiple testing problem; (iv) the expected expression levels and biological variation of genes of interest, and whether DE of gene isoforms is of interest, pertaining to sample size and to sequencing depth; and (v) the structure of experiment pertaining to analysis via GLM (is it multi-group, multi-center, multi-experimenter?). As no analytic workflow has been shown to outperform the others systematically [25,[41][42][43], foolproof best practices are still out of reach.
Another class of classic nonparametric p-value calculation methods, of which the Wilcoxon rank-sum test is the most popular, is free of distributional assumptions and has good control of FDR, but they need sample sizes of at least 10 to achieve good power [34]. We note that such sample sizes are scarce in the HT-seq DE analysis field. Recently, p-value-free approaches to FDR control that do not depend on distributional models have been proposed that look promising even in small samples [36,57]. These methods would, of course, transcend any problems caused by p-values (as well as the diagnostic value of the p distribution), except for the more general malady of effect size overestimation at low power, which comes along with any method that dichotomizes continuous data into "discoveries" and the rest [11].
While our results leave no doubt about the pervasiveness of problems in the HT-seq DE field, being meta-scientific by nature, they can offer relatively little in terms of suggestions for specific improvements to various workflows used in the field. Although examining p-value histograms and π 0 -s derived from them and more stringently excluding low-count features should make their way into every HT-seq DE analysis, we have no reason to think that these steps alone will cure the field of its ailments. Also, while removing low-expressed genes before model fitting in DE analysis can have a substantial positive effect on the sensitivity of detection of differentially expressed genes [58,59], this comes with the cost of excluding about half the genes from analysis as lowly expressed [60]. While there is no single best way to exclude features from analysis, an adaptive filtering procedure has been recently proposed [61]. In deciding, how many low-count features to exclude in a given study, both scientific and statistical considerations should play a part. Namely, quantification of differential expression of lowexpressed genes appears to be unreliable by existing DE analysis tools [62], and the sample size needed for accurate measurement of DE has a strong inverse relationship with the expression level of the gene [43].
Increasing the power by increasing the sample sizes while designing the experiments to capture the full extent of relevant biological variation in those samples would be another obvious, if expensive, recommendation. In terms of DE analysis, a significant danger seems to be overly reducing the within-data variation during data preprocessing so that biological variation that may be present in the raw data is lost during the analysis.
In conclusion, we do not think there is an apparent single fix to the problems afflicting a large field with quite heterogeneous scientific questions and corresponding experimental designs, and it may well be that the methodology that works for HT-seq DE is still in the future. We do not think it impossible that fixing the RNA-seq field requires, as a first step, the development of cheaper ways for conducting experiments, which would then make well-powered experiments practically feasible, which in turn could lead to the development of analytic workflows that work.

NCBI GEO database query and supplementary files
NCBI GEO database queries were performed using Bio.Entrez Python package and by sending requests to NCBI Entrez public API. The exact query string to retrieve GEO HT-seq datasets was "expression profiling by high-throughput sequencing[DataSet Type] AND ("2000-01-01" [PDAT]: "2020-12-31" [PDAT])." Accession numbers of cancer-related datasets were identified by amending the original query string with "AND ("neoplasms" [MeSH Terms] OR cancer[All Fields])." FTP links from GEO datasets document summaries were used to download supplementary file names. Supplementary file names were filtered for downloading, based on file extensions, to keep file names with "tab," "xlsx," "diff," "tsv," "xls," "csv," "txt," "rtf," and "tar" file extensions. We dropped the file names where we did not expect to find p-values using the regular expression "series_matrix\.txt\.gz$|filelist\.txt

NCBI supplementary file processing
Downloaded files were imported using the Python pandas package and searched for unadjusted p-value sets. Unadjusted p-value sets and summarized expression level of associated genomic features were identified using column names. P-value columns from imported tables were identified by regular expression "p[^a-zA-Z]{0,4}val," from these, adjusted p-value sets were identified using the regular expression "adj|fdr|corr|thresh" and omitted from further analysis. We algorithmically tested the quality of identified p-value sets and removed from further analysis apparently truncated or right-skewed sets, p-value sets that were not in the 0 to 1 range, and p-value sets that consisted entirely of NaN values. Columns with expression levels of genomic features were identified by using the following regular expressions: "basemean," "value," "fpkm," "logcpm," "rpkm," "aveexpr." Where expression level data were present, raw p-values were further filtered to remove low-expression features using the following thresholds: basemean = 10, logcpm = 1, rpkm = 1, fpkm = 1, aveexpr = 3.32. Basemean is a mean of library-size normalized counts of all samples, logcpm is the mean log2 counts per million, rpkm/fpkm is reads/fragments per kilobase of transcript length per million reads, aveexpr is an average expression across all samples, in log2 CPM units, whereas CPM is counts per million. Row means were calculated when there were multiple expression level columns (e.g., for each contrast or sample) in the table. Filtered p-value sets were stored and analyzed separately.

Classification of p-value histograms
Raw p-value sets were classified based on their histogram shape. The histogram shape was determined based on the presence and location of peaks. P-value histogram peaks (bins) were detected using a quality control threshold described in [27], a Bonferroni-corrected alpha-level quantile of the cumulative function of the binomial distribution with size m and probability p. Histograms, where none of the bins was over QC-threshold, were classified as "uniform." Histograms, where bins over the QC threshold started either from the left or right boundary and did not exceed 1/3 of the 0 to 1 range, were classified as "anti-conservative" or "conservative," respectively. Histograms with peaks or bumps in the middle or with non-continuous left-or right-side peaks were classified as "other." Finally, histograms with left-and right-side peaks were classified as "bimodal."

Sample size determination
Sample sizes were algorithmically determined for tables containing a single column of p-values by dividing the number of samples with the number of p-value sets + 1. Only balanced samples were retained, and all 4 or more sample sizes were manually verified.

Modeling
Bayesian modeling was done using R libraries rstan vers. 2.21.3 [64] and brms vers. 2.16.3 [65]. Models were specified using extended R lme4 [66] formula syntax implemented in the R brms package. We used weak priors to fit models. We run minimally 2,000 iterations and 3 chains to fit models. When suggested by brms, Stan NUTS control parameter adapt_delta was increased to 0.95-0.99 and max_treedepth to 12-15.

RNA-seq simulation
RNA-seq experiment simulation was done with polyester R package [67], and differential expression was assessed using DESeq2 R package [46] using default settings. Code and workflow used to run and analyze RNA-seq simulations are deposited in Zenodo with doi: 10.5281/ zenodo.4463804 (https://doi.org/10.5281/zenodo.4463804). RNA-seq simulations to determine relationship between sample size, π 0 and power were done using PROPER package [33].

S16 Fig. Removal of low-count features results in an increasing proportion of anti-conservative p-value histograms. (A)
Anti-conservative p-value histogram proportions in raw and filtered p-value sets for DE analysis programs. Raw p-value data is the same as in S5A Fig. Filtered p-value data is from a simple Bernoulli model [anticons~de_tool], N = 3,426. The data files are in S13 Data and in S14 Data (for raw data). (B) Effect size of low-count feature filtering to proportion of anti-conservative p-values. The data files are in S13 Data and in S14 Data (for raw data). (C) π 0 estimates for raw and filtered p-value sets. Raw p-value data is the same as in S6A Fig and filtered p-value data is from the beta model [pi0~de_tool], N = 2,042. The data files are in S15 Data and in S16 Data (for raw data). (D) Effect size of low-count feature filtering to π 0 . The data files are in S15 Data and in S16 Data (for raw data). Points denote model best fit. Thick and thin lines denote 66% and 95% CIs, respectively. (TIFF) S17 Fig. Simulated RNA-seq data shows that histograms from p-value sets with around 100 true effects out of 20,000 features can be classified as "uniform". RNA-seq data was simulated with polyester R package on 20,000 transcripts from human transcriptome using grid of 3, 6, and 10 replicates and 100, 200, 400, and 800 effects for 2 groups. Fold changes were set to 0.5 and 2. Differential expression was assessed using DESeq2 R package using default settings and group 1 versus group 2 contrast. Effects denotes in facet labels the number of true effects and N denotes number of replicates. Red line denotes QC threshold used for dividing p histograms into discrete classes. Code and workflow used to run these simulations is available on Github: https://github.com/rstats-tartu/simulate-rnaseq. Raw data of the figure is available on Zenodo https://zenodo.org with doi: 10