Supplementary Materials for Evaluation of Bias-variance Trade-oﬀ for Commonly Used Post-summarizing Normalization Procedures in Large-Scale Gene Expression Studies

Normalization procedures are widely used in high-throughput genomic data analyses to remove various technological noise and variations. They are known to have profound impact to the subsequent gene differential expression analysis. Although there has been some research in evaluating different normalization procedures, few attempts have been made to systematically evaluate the gene detection performances of normalization procedures from the bias-variance trade-off point of view, especially with strong gene differentiation effects and large sample size. In this paper, we conduct a thorough study to evaluate the effects of normalization procedures combined with several commonly used statistical tests and MTPs under different configurations of effect size and sample size. We conduct theoretical evaluation based on a random effect model, as well as simulation and biological data analyses to verify the results. Based on our findings, we provide some practical guidance for selecting a suitable normalization procedure under different scenarios.


The N -test
We choose a multivariate nonparametric N -distance with the Euclidean kernel as a measure of the distance between two multivariate probability distributions. Using the same notation as in the main text, the sample N -distance across conditions A and B for gene i is defined as follows: where L(x, y) = |x − y| is the kernel defined by the Euclidean distance. We apply the following algorithm to calculate the permutation-based p-values.
1. Randomly shuffle the arrays in two different conditions, then split them into two groups of equal size.
2. Compute the N -statistic for each gene.
3. Repeat the above steps for K = 100, 000 times, record the permutation based N -statistics as N ik , i = 1, . . . , m, k = 1, . . . , K. They can be used to construct the permutation based null distribution for each index i.
4. Compute N i , the N -statistic for each gene without random shuffles.
5. Obtain the permutation based p-value, p i , by comparing N i with the null distribution constructed from N ik . Specifically, p i is defined to be #(N ik N i ) K , the proportion of N ik which is greater than or equal to N i .

Analytical Evaluation of Different Normalization Methods Based on the Mixed Effect Model
Based on the assumptions stated in Equation (6) (in Section "The Bias-Variance Trade-off of Normalization Methods") of the main text, we have

Global normalization
According to the definition of global normalization (Equation (1) in the main text), the normalized gene expressions are By comparing Equation (2) with Equation (1), it is clear that the global normalization introduces a small bias −Eȳ c ·j = − m + 1 e + +m − 1 e − m . More specifically, the expected mean group difference for global normalized data are The (4)

Quantile normalization
We investigate the bias effects through the following two aspects.
1. The rank skewing effect. Over(under)-expressed genes tend to have very high (low) ranks. It means that the NDEGs are much more likely to take only medium ranks. When the effect size is large, the DEGs in group A effectively take up all the top and bottom ranks, so the NDEGs in group A can only compete for ranks between m − 1 + 1 and m − m + 1 . For simplicity, we assume all NDEGs in group A have equal chances to take ranks from m − 1 + 1 to m − m + 1 . The expectation of y * A i· , i ∈ G 0 , is r),· ). δ 1 represents the mean expectation of the top m + 1 ordered expression levels and δ 2 represents the mean expectation of the bottom m − 1 ordered expression levels, provided that there is no differentially expressed gene. Clearly δ 1 and δ 2 do not depend on e + and e − . We assume all genes in group B have equal chances to take ranks from 1 to m. For i = 1, . . . , m, The difference between these two expectations explains the spurious effect introduced to NDEGs by the quantile normalization. A similar argument also explains the bias introduced by the rank normalization, which will be discussed later.
2. The averaging effect. Denote the reference quantile array constructed from one group by q c , c = A, B, we have In other words, the reference array q is computed by averaging both DEGs and NDEGs over arrays in two phenotypic groups, so the m + 1 over-expressed genes and m − 1 under-expressed genes in group A are "mixed up" with NDEGs of the same rank from group B. In fact, assuming that the m + 1 up-regulated genes almost always take the top m + 1 ranks with equal chances and the m − 1 down-regulated genes almost always take the bottom m − 1 ranks with equal chances, we can compute that for i ∈ G + 1 , As mentioned before, we focus on the case in which the up-regulated (down-regulated) genes almost always take the top (bottom) ranks. Recall that δ 1 (δ 2 ) represents the mean expression of the top m + 1 (bottom m − 1 ) genes, given that there is no DEG. In order for up-regulated (down-regulated) genes to almost always take up the top m + 1 (bottom m − 1 ) ranks, we must have Thus |E(y * A i· )| is smaller than the original effect size.
Based on the above calculations, the expected group differences are

Rank normalization
The number of genes (m) is usually very large in a typical microarray study. If the effect size is large such that the over-expressed genes always take up the top m + 1 ranks and the under-expressed genes always take up the bottom m − 1 ranks in group A, y * c ij approximately has the following uniform distribution: Here we assume that the genes take the specified ranks with equal chances for simplicity.
The expected group expression differences are 2.4 δ-sequence method The variance reduction effect of the δ-sequence method comes from the gene pairing and subtraction. From Equation (5) in the main text, we have Thanks to sorting by sample variance, we have σ

Results of simulation studies
From Tables S1, S3 and S5, we see that gene selection strategies based on both t-test and N -test almost always have comparable or better statistical power than the Wilcoxon rank-sum test for SIMU1, SIMU2, and SIMU3. The comparison between the t-test and N -test are not that clear cut, and the pattern seems to be: 1. The t-test has comparable or better statistical power than the N -test with NONE, GLOBAL, QUANT and DELTA. 2. The N -test outperforms t-test with RANK. It is well known that the Student's t-test is optimal when the inputs (normalized data) are normally distributed. In this study, SIMU1, SIMU2, and SIMU3 are all simulated based on Gaussian models. Furthermore, both GLOBAL and DELTA are linear transformations of data hence they preserve normality. QUANT transforms data by using quantiles of mixture normal distribution, so it also preserves normality to a certain extent. Hence it is no coincidence that the t-test attains best power in these situations. On the other hand, rank normalized data are highly non-normal. The N -test outperforms the t-test in this situation because it is a nonparametric test which is sensitive to a large class of distributional differences.
For SIMU-BIO S7 we observe some interesting differences. When applied to the normalized data, the N -test (or even the Wilcoxon rank-sum test) outperforms the t-test most of the time. The N -test also outperforms the t-test with unbalanced differential structure and a large sample size (n = 79) when no normalization is applied. This suggests that the original and normalized gene expressions in the biological data may not be normally distributed.
As observed in the simulation studies, gene selection strategies with normalization procedures detect more DEGs than those without normalization. When BH is applied in HYPERDIP vs. TEL, strategies with GLOBAL detect more positives than those with other normalization procedures. However, the strategies with QUANTILE and RANK detect more positives than those with GLOBAL when BONF is applied or the comparison is between TALL and TEL. This observation suggests that the technical noise may not be purely additive and is consistent with what we observe in SIMU-BIO. Among four normalization procedures, DELTA is the most conservative one in terms of the number of positives. Based on our simulation results, it is reasonable to believe that DELTA has relatively better control of type I errors.
By and large, the gene selection strategies based on N -statistics detect more positives than those based on the t-test and the Wilcoxon rank-sum test. More often than not, even the Wilcoxon rank-sum test detects more DEGs than the t-test. Just like in the SIMU-BIO study, this suggests that the expressions of biological data may not be normally distributed. Since tests based on N -statistic and Wilcoxon rank-sum statistic are both nonparametric, if the normality of data is in question, N -statistic can be used in place of