A Zipf-plot based normalization method for high-throughput RNA-seq data

Normalization is crucial in RNA-seq data analyses. Due to the existence of excessive zeros and a large number of small measures, it is challenging to find reliable linear rescaling normalization parameters. We propose a Zipf plot based normalization method (ZN) assuming that all gene profiles have similar upper tail behaviors in their expression distributions. The new normalization method uses global information of all genes in the same profile without gene-level expression alteration. It doesn’t require the majority of genes to be not differentially expressed (DE), and can be applied to data where the majority of genes are weakly or not expressed. Two normalization schemes are implemented with ZN: a linear rescaling scheme and a non-linear transformation scheme. The linear rescaling scheme can be applied alone or together with the non-linear normalization scheme. The performance of ZN is benchmarked against five popular linear normalization methods for RNA-seq data. Results show that the linear rescaling normalization scheme by itself works well and is robust. The non-linear normalization scheme can further improve the normalization outcomes and is optional if the Zipf plots show parallel patterns.


Introduction
Errors are inevitable and exist in many analytical platforms, including microarray, qRT-PCR, and next generation sequencing (NGS) technologies. Errors can be introduced in deep NGS data from different sources including sample handling, library preparation, sequencing, and many others. Substitution error is one type of sequencing error. It plays an important role in low-frequency genetic variant detection and can be corrected experimentally and computationally [1]. NGS data often come with measurement errors. For example, gene expressions can be length-biased. Measures of weakly expressed genes are often truncated to zeros due to sequencing depth or detection limits with the current technologies. When done appropriately, normalization can remove the unwanted artificial variations and hence improve the downstreaming data analyses such as differentially-expressed gene detection. Meanwhile, the normalization procedure by itself can introduce new errors to the analyses. Sometimes, it removes the true variations by causes of interest. With the plummet of sequencing costs in recent years, massive RNA-seq gene expression data have become publicly available. As a result, it becomes feasible to make good inferences of the weakly expressed genes via bulk RNA-seq data analysis. However, NGS data normalization is challenging and remains a major barrier to pooling data from different sources (by different laboratories, from different tissues, using different sequencing technologies and others). In general, the normalization methods for gene expression data can be classified into two categories: linear normalization methods and non-linear normalization methods. For linear normalization methods, a profile is rescaled by diving the expressions in the same profile by a positive normalizing parameter so expressions from different profiles are aligned to the same level for direct comparisons. Popular linear normalization methods include but not limited to the following five methods: • Total count method (TC): TC assumes that all profiles have the same number of reads. This method is simple and easy to implement. However, the total count of a profile can be greatly impacted by the extremely large measures and the performance of TC can be negatively affected by outliers [2,3].
• Median normalization method (MED): MED assumes that the gene expression distributions of all profiles have the same center and the median expression of a profile is used as its normalizing parameter [4,5]. MED is easy to implement and works well when the majority of genes are strongly expressed. However, In cancer related genome data analysis, the common median assumption is very likely to fail. In whole genome sequencing, a lot of genes are not expressed and many genes are weakly expressed. When the next generation sequencing technologies are used, the not expressed genes and the genes expressed below the detectable levels (due to the limits of technologies or experimental settings) have zero measures. MED is not applicable for the zero-inflated gene profiles. MED has been implemented within the DESeq Bioconductor package.
• Upper quartile method (UQ): Similar to MED, UQ assumes that the expression distributions of the two profiles to be normalized have the same upper quartile. UQ has been used as a remedy for gene profiles with excessive zeros or weakly expressed genes. However, UQ is not applicable if the proportions of zero counts are higher than 75%. In microarray or RNAseq gene expression data, it is not uncommon that the vast majority of genes are weakly or not expressed. When the proportion of weakly or not expressed genes is below but close to 75%, the performance of UQ can be poor as the sample upper quartiles can be small and the variation can be dominated by the measurement errors. UQ has been implemented within the edgeR Bioconductor package.
• Trimmed mean of M-values method (TMM): TMM assumes that the majority of genes are not DE across all profiles. It computes the fold changes (ratios of expressions) of genes in a pair of profiles and excludes the extremely large or small fold changes. The mean fold changes of the remaining genes in a profile is computed as its normalization parameter [4,6,7]. TMM works well when the number of genes are large and the majority of genes are strongly expressed. However, when the amount of weakly or not expressed genes is large, the fold changes can be greatly impacted by the measurement errors and hence the performance of TMM will be negatively affected. Also, such an approach can be problematic when only a subset of selected genes are profiled, as there is no guarantee that the majority of the genes are not DE. The performance of TMM also depends on the percentage of values trimmed at the two ends. TMM has been implemented within the edgeR Bioconductor package.
• Relative log expression method (RLE): RLE is similar to TMM. It also assumes that the majority of genes are not DE and the normalizing parameters are determined by using the median fold changes instead of the trimmed means [8][9][10]. RLE is less sensitive to extreme gene expressions than TMM. However, its performance may not be good when there are excessive weakly or not expressed genes in the profiles. RLE has been implemented in the DESeq and DESeq2 Bioconductor packages.
The non-linear normalization methods are usually based on stronger assumptions and are used for special purposes. For example, reads per kilo-base per million mapped reads (RPKM), fragments per kilo-base per million mapped reads (FPKM), and transcripts per million (TPM) methods are commonly used to correct the length biases for genes of different sizes. The normalization steps of these three methods are similar to TC except an extra adjustment step to correct the biases under an assumption that the expression measures are proportional to the lengths of the genes. The quantile normalization (QN) is another widely used non-linear method which assumes that all profiles follow similar distributions and allows gene-level expression modifications so that all profiles have the same distribution after normalization [11][12][13][14].
In this study, we propose a novel normalization method designated for RNA-seq data with excessive zeros and a large number of small counts (� 10). With the new normalization method, we perform linear and/or non-linear normalization so that the normalized profiles have similar upper tail behaviors as revealed in the Zipf plots.

RNA-seq data for lymphoblastoid cell lines
As part of the International HapMap project, a total of 52,580 RNAs were sequenced from lymphoblastoid cell lines (LCLs) derived from 69 Nigerian individuals generated using Illumina Genome Analyzer II (Homo sapiens) in a study by Pickrell and et al [15]. The reads are either 35 or 46 base pairs (bp) and are mapped using MAQ v0.6.8. The datasets are counts based without normalization, and are publicly available to download on Gene Expression Omnibus (GEO) by NIH NCBI (accession number "GSE19480"). Table 1 shows some summaries of the gene expression measures. The smallest percentage of zero measures among the 129 gene expression profiles is 83.2% and the largest is 86.37%.

PLOS ONE
That says, if a positive quantile needs to be used as the normalizing parameter to linearly rescale the expressions in a profile, the quantile level cannot be lower than 86.37%. Hence MED and UQ won't work for this data without special handling. In addition, the average percentage of counts with measures less than 10 across all profiles is 89.46% with standard deviation 0.92%. If we simply choose a high level quantile, say the quantile q α with level α = 0.90 or lower, as the normalizing parameter, many profiles may have the same quantile values of q α due to the existence of the large amounts of tied small counts. As a result, either no normalization will be performed, or small adjustments will be applied to correct the artificial variations in these profiles. In addition, when the normalization parameter is small, the measurement errors or other types of errors can play important roles in the estimated normalizing parameters and hence bias the normalized data. On the contrary, if we choose very high level quantiles as normalizing parameters, the variances of the estimated normalization parameters tend to be large.

The Zipf plot for NGS gene expression data
Plots (a)  . ., f m be the corresponding frequencies. We assume that the actual gene expression level of a gene is no smaller than the observed count and compute the rank of z i as follows: The plot of the logarithms of ranks versus the logarithms of z i 's is the so-called Zipf plot. Plot (c) of Fig 1 shows the Zipf plots of all LCL profiles. Due to the Zipf plots in plot (c) being densely packed, the profiles can hardly be distinguished from each other. However, the few isolated Zipf plots at the lower part demonstrate obvious patterns of parallel curves. The Zipf plot has been widely used to check whether a distribution follows the power law. For any x min > 0, if X � x min follows a power law distribution, the density function and distribution function of X are where κ > 1 is the parameter of the power law distribution. In the Zipf plot, rank r i can be well approximated by r i � n[1 − F(z i )] when n is sufficiently large, which is often the case for highthroughput RNA-seq data. Now we have For a power law distribution, we expect its Zipf plot to demonstrate a straight line pattern with slope 1 − κ < 0. Plot (c) reveals that expressions of the highly expressed genes in the LCL profiles approximately follow power law distributions. A modified version of the Zipf plot is which only results in a downward shift of log(n) in the y-axis.

Normalizing NGS gene expression data
For illustration purposes, we choose two profiles whose Zipf plots are far apart. Fig 2 shows the Zipf plots of profile 47 (solid curve) and profile 107 (dashed curve). For convenience, we reuse the notation X and denote a pair of profiles to be normalized as X and Y, respectively. Let . . . ; r x m x g be the corresponding ranks. Similarly, we denote the distinct measures and ranks of profile Y as z y ¼ fz y 1 ; z y 2 ; . . . ; z y m y g and r y ¼ fr y 1 ; r y 2 ; . . . ; r y m y g, respectively. When both profiles follow power law

PLOS ONE
distributions for X � x min and Y � y min , we have If the highly expressed genes in both profiles follow the same or similar power law distributions, the right ends of the two Zipf plots are expected to show straight line patterns, and the slopes of the fitted least squares lines should be close or the same. In Fig 2, the slope of the fitted least squares line of profile X (solid line) is very close to that of profile Y (dashed line). Without loss of generality, let X be the reference profile and Y be the profile to be normalized. We normalize profile Y so that the two Zipf plots stay as close as possible after normalization. Graphically, we keep the Zipf plot of profile X unchanged and shift the Zipf plot of profile Y in the x-axis then rotate it, which can be implemented using the following two normalization schemes:

PLOS ONE
• Rescaling: shifting in the x-axis is equivalent to linear rescaling. Let σ be the linear normalizing parameter. We have logY 0 = log(Y/σ) = logY − logσ. Here logσ is the distance that the Zipf plot of Y needs to move towards the Zipf plot of X in the x-axis.
• Power transformation: let g ¼ ð1 Àk x Þ=ð1 Àk y Þ be the non-linear normalizing parameter, which is the ratio of the slopes of the two least squares lines for profiles X and Y. We can apply a power transformation Y 0 = Y γ so the two least squares lines have the same slope.
In summary, profile Y is normalized with respect to reference profile X as where σ and γ are the two normalizing parameters. We divide the expressions in the profile to be normalized by σ to rescale the expressions so the two profiles will be normalized to close levels. The power transformation with parameter γ further improves the similarity between the upper-tail behaviors of the two distributions.
It is worth pointing out that estimating the power law distribution parameters κ x and κ y using the ordinary least squares regression with pre-specified values of x min and y min might not be efficient. We can consider estimating κ x , κ y , x min and y min altogether based on the highly expressed genes in the two profiles using the Hill estimator [16], or a maximum likelihood estimator as in [17].
As shown in Fig 2, the Zipf plot of profile Y after linear normalization (the dash-dotted curve) is almost identical to the Zipf plot of profile X (solid curve). The slopes of the two fitted least squares lines are very close, which suggests that a non-linear normalization might not be necessary. , respectively. The MA plot is a commonly used graphical tool to visualize highthroughput sequencing analysis. It first transforms the data in the two profiles onto M (log ratio) and A (average log-expression) scales, then visualizes the differences of the measurements in the two profiles by plotting M against A. In plot (a) we take M = log(X) − log(Y) and A = (log(X) + log(Y))/2, where X and Y are the measures in profiles 47 and 107, respectively. In plot (b), we replace Y with the normalized value Y 0 . The solid curve is fitted using R function loess with span = 0.2, while the dashed curve is the loess curve with span = 0.1. From plot (a), we see that the majority of the points fall below the horizontal line with M = 0. This indicates that most genes have larger measures in profile 107 and the expressions need to be scaled down so the two profiles will be brought to close levels. The loess curves show straight line patterns for genes satisfying A > 5, which indicates that linear scaling is appropriate for the genes that are not weakly expressed. Meanwhile, plot (a) of Fig 3 shows that the loess curves have different non-linear patterns for points with A < 2, which indicates that linear rescaling normalization alone may not be able to achieve good normalization results for weakly expressed genes. In other words, a non-linear transformation is needed to further improve the normalization outcomes.

An example to normalize two gene profiles
We choose the highly expressed genes by selecting the expressions with ranks exp(6) = 403 or lower and fit least squares lines to these points in the Zipf plots. The slopes of the fitted least squares lines are -1.734 and -1.730 for profiles 47 and 107, respectively. Hence a rough estimate of γ isĝ ¼ ðÀ 1:730Þ=ðÀ 1:734Þ ¼ 0:9977.
. . . ; logðz y 1 Þ; logðz y 2 Þ; . . .Þ be a vector of the log-transformed gene expressions of the selected highly expressed genes from the two profiles. Also, let R ¼ ðlogðr x 1 Þ; logðr x 2 Þ; . . . ; logðr y 1 Þ; logðr y 2 Þ; . . .Þ be a vector of the corresponding log-ranks, and G = (0, 0, . . ., 1, 1, . . .) be the group identities. If a point Z i is from profile X, G i = 0, otherwise G i = 1. We then fit a linear model by regressing Z on R and G. The estimated coefficient of G is our estimate of σ.
Plot (b) of Fig 3 shows the MA-plot after normalization. Many genes located above the M = 0 line after normalization and the loess curves become more linear than the ones for the pre-normalization data.

Performance comparison
To evaluate the performance of the proposed normalization method, we normalize all LCL profiles with each of the following methods: TC, MED, UQ, TMM, RLE, and the Zipf plot based normalization (ZN). For fair comparisons, we only use the linear rescaling normalization scheme for ZN (without non-linear normalization). In Fig 1, many Zipf plots stay apart

PLOS ONE
from each other, which indicates that linear normalization is needed to pull the profiles to similar levels. However, non-linear normalization may not be needed as the Zipf plots show parallel curve patterns in their upper tails. This is also verified by the estimated non-linear normalizing parameter between profiles 47 and 107, which is very close to 1.0 indicating that the power transformation can improve the normalization results but not much.
To reduce the numerical difficulties caused by the zeros, we exclude 39,596 (75.31%) genes with zero across all profiles. A total of 12,984 genes remain in the new dataset, where profile 58 has the smallest number of zeros (20.79%). To improve the performances of MED, UQ, TMM and RLE, we choose profile 58 as the reference profile.
• TC: this method applies to all profiles without any numerical difficulty.
• MED: among all profiles to be normalized, four profiles (3.13%) have zero median. Three of them are normalized using MED after removing all genes with zero counts in both profiles. One profile has zero median after we filter out the genes with zero in both profile. Therefore, all genes with zero in either profile are removed to apply the MED.
• UQ: all profiles have positive upper quartiles in the new dataset. UQ works with no numerical difficulty.
• TMM: to avoid numerical difficulties, genes with zero measures in both profiles are filtered out. We first compute the ratio of expressions in the profile to be normalized and the reference profile for each gene. Then we calculate the M-values by taking the logarithm of the ratios. A total of 15 profiles (11.72%) have infinite means after trimming 30% M-values from both ends. Genes with zeros in either profiles are removed for these 15 profiles to apply TMM.
• RLE: similar to TMM, genes with zero measures in both profiles are removed to compute valid M-values. After that, only one profile doesn't have valid median M-value and genes with zeros in either profile are removed to apply RLE.
• ZN: this method applies to all profiles without any numerical difficulty. Table 2 shows the summaries of the estimated normalizing parameters for the six selected normalization methods. Results show that MED has overall larger estimated normalizing parameters than the other methods. As a result, only 14.1% of the profiles are upward rescaled with MED while the percentages for the other methods are at least 50%. This is mainly because the reference profile has the smallest number of zeros and the calculation of the median is closely associated with the number of zeros in a profile. In terms of the number of zeros, the selected reference profile has overall strong signals and therefore most of the other profiles are downward rescaled if we use median to compute the normalizing parameter. There are a certain number of profiles not normalized with MED, UQ and RLE. This is because these three methods use a single quantile estimate for each profile to compute the normalizing parameter. Due to the existence of large numbers of small counts in the profiles, tied quantile estimates are likely to be found between a pair of profiles which ultimately results in normalizing parameter estimates of 1.0.
To further benchmark the performances of the six normalization methods, we identify and compare a set of "invariants" based on the normalized data. The idea is as follows: if the majority of genes are not DE and if a normalization method works well, we expect to be able to identify a subset of genes that are strongly expressed with small deviations, namely, the invariants. To take both the deviation and expression level into consideration, we use the coefficient of variation (CV) to determine whether a gene is an invariant. Genes with 50% or more zeros will not be considered as invariants because most of them are weakly expressed and can be significantly impacted by measurement errors. Because the CVs are severely skewed to the right, we perform logarithm transformation to the CVs and show the distributions. In Fig 4, Table 3 compares the agreement among different normalization methods. The diagonal cells show the cutoff CVs used to identify 1000 invariants with each normalization method. TMM and RLE have much larger cutoff CVs than the other four methods. The numbers in the lower triangular matrix show the number of genes identified by both methods. For example, among the 1000 invariants identified by TC, 850 are also identified by ZN. TMM and RLE have the largest overlap of 911 common invariants. The values in the upper triangular matrix show the Spearman correlation coefficients of the CVs computed based on data normalized using two different methods. The coefficient between TMM and RLE is 0.989, and 0.987 between ZN and TC. The overall agreement among the six normalization methods is good.
We also check the similarity of the upper-tails of all profiles by comparing the slopes of the least square lines fitted on genes with log(Rank)<6 in their Zipf plots. The 95% confidence interval of the estimated slopes is (−1.745, −1.713), which confirms that all profiles have similar patterns of upper-tail behaviors and the non-linear normalization can be optional.

Discussion
The estimates of the two normalization parameters, σ and γ, can be further improved. To find the optimal estimates of (σ, γ) jointly, we search over a fine grid in the neighborhood of the OLS estimates based on the Zipf plots as in Fig 2. We define the optimal estimates as a pair of values of (σ, γ) that minimize the the following objective function: The rationale that we choose the above Kolmogorov-Smirnov statistic to measure the distance between the distribution functions of the two profiles after normalization is its simple form and the good performance of the empirical distribution function in approximating the true distribution functions for large samples. Fig 5 shows the perspective plot of the surface of D over the σ-γ plane for profiles 47 and 107. It shows that D doesn't change much as γ varies. However, D changes greatly if we fix γ and change the value of σ alone. Fig 6 shows the MA-plot after normalization using the optimal estimates of (σ, γ) = (4.6261, 0.9874). There are very small differences between the two loess curves fitted to data normalized using the OLS estimates and the optimal estimates, respectively. Due to the lack of ground truth about the gene expressions in different profiles, it is hard to tell which estimate works better.

PLOS ONE
For genes expressed at very low levels, say A < 2, a more aggressive normalization strategy to manipulate the expressions at gene-level is not necessary. The low measures tell us whether the genes are detectable in an experiment and we shall not emphasize too much on their absolute expressions. In DE gene detection analysis, the expressions of the weakly expressed genes need to be modeled differently to appropriately calibrate the impacts of the measurement errors.

Conclusion
The proposed normalization method assumes that the distributions of the gene expressions in different profiles have similar upper tail behaviors, which is a reasonable assumption in

PLOS ONE
practice. Although we derive the normalization algorithms via an example using power law distributions, the expressions of the highly expressed genes don't have to follow power law distributions. In case the Zipf plots don't show straight line patterns, we can find rough estimates of the normalization parameters and refine the normalization by searching for optimal estimates numerically. As a by-product, the Zipf plots can be used to visually check the normalization results.
The selection of normalization method depends on the quality of the data. When the measures are overall large with the majority genes being not DE, TMM and RLE are expected to have good performances. On the contrary, if most of the genes are weakly or not expressed, ZN and TC are reliable and work well. In addition, the proposed normalization method can be used together with the other normalization methods. For example, we can first normalize the RNA-Seq tag counts using the FPKM or RPKM to correct the length-bias, then apply ZN for further improvements. Though, ZN won't be applicable to the data that have already been normalized using QN because the QN-normalized gene profiles have the same distribution.  (σ, γ). The solid curve shows the loess curve based on the data normalized with the optimal estimates, while the dashed curve is the loess curve based on data normalized using the OLS estimates. A smoothing parameter span = 0.2 is used for both curves. https://doi.org/10.1371/journal.pone.0230594.g006

PLOS ONE
If a conservative normalization strategy is preferred, we can choose to perform the linear rescaling normalization without non-linear normalization using ZN. Our method works pretty well in finding reliable normalizing parameters, especially for NGS data with excessive zeros and/or large numbers of small counts.
All the algorithms have been implemented in R and collected in function Zipf.Normalize in package bda.