MicroRNA Array Normalization: An Evaluation Using a Randomized Dataset as the Benchmark

MicroRNA arrays possess a number of unique data features that challenge the assumption key to many normalization methods. We assessed the performance of existing normalization methods using two microRNA array datasets derived from the same set of tumor samples: one dataset was generated using a blocked randomization design when assigning arrays to samples and hence was free of confounding array effects; the second dataset was generated without blocking or randomization and exhibited array effects. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization and compared against the benchmark. Normalization improved the true positive rate significantly in the non-randomized data but still possessed a false discovery rate as high as 50%. Adding a batch adjustment step before normalization further reduced the number of false positive markers while maintaining a similar number of true positive markers, which resulted in a false discovery rate of 32% to 48%, depending on the specific normalization method. We concluded the paper with some insights on possible causes of false discoveries to shed light on how to improve normalization for microRNA arrays.


Introduction
Normalization is an essential step of microarray data preprocessing [1]. It aims to remove non-biologic effects resulting from the experimental process so that biologic effects can be accurately identified [2,3]. Methods for array normalization have been developed in the context of mRNA expression arrays. Main-stay approaches, such as median normalization and quantile normalization, are based on the data of all genes on the array, which we call ''all-gene'' methods [4,5]. They rely on the assumption that only a very small proportion of the molecular markers on the array are differentially expressed, or that the numbers of over-and under-expressed markers are similar.
MicroRNAs (miRNAs) are a prevalent class of small singlestranded non-coding RNAs that negatively regulate gene expression by inducing mRNA degradation and translational repression, and are involved in a wide variety of cellular functions such as proliferation, differentiation, and apoptosis [6][7][8]. MiRNA arrays possess a number of unique data characteristics comparing with mRNA arrays. First, miRNA arrays contain markers for a much smaller number of miRNAs -the Agilent miRNA arrays (release 16.0) have markers for about 1,300 miRNAs, while mRNA arrays typically have markers for tens of thousands of genes. Second, differential expression is more likely to be common and asymmetric among miRNAs. The majority of miRNAs are expected to express in a very tissue-specific manner [9][10][11][12]. They were found to be important in tumorigenesis and show widespread differential expression between malignant and normal cells caused by mechanisms such as chromosomal alterations, DNA point mutations, and epigenetic changes [11,13]. As a result, miRNA array data challenge the assumption of the all-gene methods (that the proportion of differentially expressed markers is small or the amount of over-and under-expression is similar). Therefore, the performance of normalization methods needs to be re-assessed for miRNA arrays using genuine benchmark datasets that realistically represent miRNA array data characteristics.
In this paper, we report the results from an empirical evaluation of normalization methods using a pair of miRNA array datasets generated at Memorial Sloan Kettering Cancer Center [14]. This study examined miRNA expression in a set of 96 high-grade serous ovarian cancer samples and 96 endometrioid endometrial cancer samples, which were all newly diagnosed, previously untreated, and collected at Memorial Sloan Kettering Cancer Center between 2000 and 2012, using Human miRNA microarrays (Agilent Technologies). Two datasets were generated for the same set of tumor samples using different experimental designs: (1) in one dataset, arrays were assigned to samples using a blocked randomization design and handled by an experienced technician in one single run; (2) in another dataset, arrays were processed in the order that specimen were collected and handled by two technicians in multiple runs. The randomized dataset contained no confounding array effects and required no normalization, while the non-randomized dataset possessed array effects and needed normalization. The randomized dataset was assessed for differential expression between two tumor groups and treated as the benchmark. The non-randomized dataset was assessed for differential expression after normalization (with or without a separate batch adjustment step) and compared against the benchmark. We concluded the paper by providing insights on possible causes of false discoveries and potential directions to further improve normalization for microRNA arrays.

Methods
The use of human tumor tissues in our study was approved by the Memorial Sloan Kettering Cancer Center Institutional Review Board.
Justification for the randomized data as the benchmark Analysis of variance (ANOVA) has been successfully used to model the relation between mRNA gene expression and sample group, which attributes gene expression variation to factors such as marker effect, sample effect, and stochastic noise [15][16][17][18]. Here we use ANOVA to model miRNA expression and thus obtain insights on how randomization removes confounding array effects.
Let y Ã gi denote the true underlying expression level for marker g and sample i, andx i denote sample group (an indicator variable taking values 0 and 1 for a two-group study such as a case-control study). We can model y Ã gi as the sum of marker effect b g , sample effect x i c g , and stochastic noise e gi .
y Ã gi~bg zx i c g ze gi Let y gijp denote the observed expression level measured by microarray j for marker g, sample i, and replicate p. Agilent release 16.0 arrays contain 3,523 markers (that is, probes) representing 1,347 miRNAs and 10-40 replicates for each marker. Since j = 1 for all samples in the randomized data, we will use y gip for simplification. y gip can be modeled as the sum of true expression level y Ã gi , array effect a gi , and measurement error e gip .
y gip~y Ã gi za gi ze gip b g zx i c g za gi z(e gi ze gip ) In the setting of marker-specific comparisons (that is, one between-group comparison per marker), we assume that (1) b g is a fixed effect representing the baseline expression for marker g, (2) c g is a fixed effect representing the difference of expression between sample groups for marker g and is the parameter of interest, (3) a gi is a random effect whose mean depends on non-biologic factors such as array production batch, hybridization run, technician, and scanner, (4) e gi and e gip are random effects each normally distributed with mean 0, and (5) all the random effects are independent. Our model uses a most general form for array effects and allows it to be marker-and sample-specific. When a sample is profiled on only one array, array effect a gi is not identifiable from sample effect e gi . The goal of normalization hence is to introduce reasonable assumptions and effectively model array effects across markers so as to make array effects identifiable, estimable, and subsequently removable.
Different from normalization, randomization requires no modeling of array effects and it removes their confounding effect by balancing them between two sample groups. The mean of the observed expression y gip among control samples and case samples are denoted as E 0 (y gip ) and E 1 (y gip ), respectively, and they can be expressed as: Hence the difference in the observed average expression between cases and controls is That is, the difference in observed means is biased by Da g . The variance of the observed expression y gip can be expressed as var(y gip )~var(a gi )zvar(e gi )zvar(e gip ) Thus the presence of array effects influences both the accuracy and precision of the estimate of c g , and consequently the accuracy of the hypothesis test for detecting markers that are truly differentially expressed. The estimated c g , the parameter of interest, will achieve the best accuracy and precision when Da g~0 and var(a gi )~0.
The randomized dataset in our study included 192 arrays. Agilent miRNA arrays have an 8-plex block design, with eight arrays on each slide arranged as two rows and four columns. When assigning arrays to sample groups we used the blocked randomization design with row and column balance in order to ensure the best balance. There are six possible configurations where the numbers of cases and controls are equal on any row and any column of a slide [14]. We assigned the 24 slides for the 192 samples to one of the six configurations with equal probabilities and dedicated the arrays to one of the two sample groups. We then assigned each group of arrays to a random permutation of the samples in the corresponding group. As the result of randomization, Da g is close to zero for all markers in the randomized data.
We carefully planned our study for generating the randomized dataset. All 192 arrays were ordered from the same manufacture batch. Their hybridization and production were processed in one run by a single experienced technician at the Memorial Sloan Kettering Cancer Center Genomics Core Lab. As a result of the uniform array handling, var(a gi ) is minimal in the randomized data.

Statistical analysis of the randomized data
Data preprocessing. We analyzed the data both with and without background subtraction, and found similar results in terms of the relative performance of normalization methods. We report the results for the analyses without background subtraction here. There was minimum variation among replicates for the same probe [14]. We summarized data from replicates for the same probe using the median on the log2 scale. Differential expression analysis. We assessed evidence against the null hypothesis of equivalent expression, using the t statistic comparing two sample groups in the randomized data [19]. A separate test was performed for each of the 3,523 markers on the Agilent array and a two-sided p-value was calculated. The p-values were then used to derive a marker set at a given significance level: markers whose p-values were smaller than the significance level were declared differentially expressed, and those having larger p-values were declared non-differentially expressed.

Statistical analysis of the non-randomized data
Data preprocessing. We preprocessed the non-randomized data with array normalization followed by probe summarization on the log2 scale. We applied commonly-used methods for normalization and used the same approach to probe summarization as for the randomized dataset. The normalization methods we used were median normalization, quantile normalization, cyclic loess normalization, and variance stabilizing normalization (Table  S1). Briefly, median normalization shifts the data on an array by a constant so that arrays share the same median [20]; quantile normalization calculates a reference distribution as the averaged order statistics across arrays and then reset the order statistics on each array to this reference distribution [4]; cyclic loess normalization iterates through array pairs in a pre-specified order, and for each pair, it plots the difference of the two arrays versus their average intensity, fits a loess curve, and uses it as the new horizontal axis [4]; variance stabilizing normalization transforms the data (before log2 transformation) using a family of parametric transformations so that the variance of the resulted data is independent of the mean [21]. All four methods are based on the data of all markers on the array. In addition to normalization, we also tested whether adding a batch adjustment step before normalization can further improve the accuracy of biomarker discovery. We used two batch adjustment methods that have been proposed to adjust for gross differences between arrays: (1) standardization and (2) ComBat [22,23].
Differential expression analysis. The preprocessed data were analyzed for differential expression using the same approach as that for the randomized dataset. The resulting p-values and the differential expression status based on the non-randomized data were compared with the differential expression status derived from the randomized data using ROC curves and cross tabulation, respectively [24].

Empirical evaluation of array normalization
For the purpose of evaluating the effect of normalization on the accuracy of biomarker discovery, we called the randomized data as the benchmark data and the non-randomized data as the test data. Figure 1 shows the effect of normalization on the overall distribution of the test data. Table 1 shows the relative accuracy of biomarker detection in the normalized test data comparing with the benchmark data.
Among the 3,523 markers on the Agilent array, 351 markers (10%) were identified to be differentially expressed in the benchmark data, indicating a moderately abundant level of asymmetric differential expression. Without normalization, 1934 markers (55%) were identified to be differentially expressed at a pvalue cutoff of 0.01 in the test data, which was associated with a true positive rate (TPR) of 185/351 (53%), a false positive rate (FPR) of 1749/3172 (55%), and a false discovery rate (FDR) of 90%. Almost all of the false positive markers had very low expression levels and some of the false negative markers had medium to high expression levels in the benchmark data ( Figure  S1).
With normalization, the number of differentially expressed markers was reduced to 639 (TPR: 85%, FPR: 11%, FDR: 54%)  Figure  S2). Normalization improved the detection of differentially expressed markers with both an increased number of true positive markers and a decreased number of false positive markers in the empirical analysis of our data. However, even with the application of normalization, the number of false positive markers was still as many as the number of true positive markers, corresponding to a false discovery rate of about 50%, regardless of the specific normalization method used.

Empirical evaluation of array normalization following batch adjustment
We next examined whether an addition of a batch adjustment step before array normalization can further improve the accuracy of differential expression detection. We compared the accuracy of calling differentially expressed markers with versus without batch adjustment, in combination with each normalization method tested in our study. It showed that normalization alone called highly similar markers positive to normalization combined with standardization, and moderately similar to normalization combined with ComBat ( Figure S3).When comparing with the benchmark data, adding standardization to normalization made virtually no change to the number of false and true positive markers ( Figure S4); adding ComBat further reduced the number of false positive markers while maintaining a similar number of true positive markers, which led to a FDR of 32% to 48% depending on the specific normalization method (Table 2 and Figure S4). Taken together, these results support the use of ComBat in combination with normalization to improve the accuracy of biomarker discovery in the analysis of miRNA array data. The choice of normalization method, however, depends on a trade-off of true positive rate and false positive rate. When combined with ComBat, quantile normalization (TPR: 93%, FPR: 9%, FDR: 48%) and cyclic loess normalization (TPR: 91%, FPR: 7%, FDR: 42%) had a high TPR but a relatively high FPR, while median normalization (TPR: 84%, FPR: 4%, FDR: 32%) and no normalization (TPR: 66%, FPR: 1%, FDR: 16%) had a relatively low TPR but a low FPR.  Causes of false positive and false negative markers There are two possible mechanisms through which false negative (or false positive) markers can occur as a result of array effects and attempted removal of array effects by normalization: one is by introducing bias to the data, and another is by increasing (or decreasing) the variability of the data. In order to examine these two possibilities we looked at scatter plots of the following summary statistics: (1) mean expression differences between two tumor groups for the benchmark data versus that for the test data, in order to look at the level of biases in the test data; and (2) marker-specific standard deviations for the benchmark data versus that for the test data, in order to look at the change of data variability. Figure 2A and Figure 3A show that array effects led to both a (dominantly negative) bias in mean differences (that is, ovarian mean minus endometrial mean) and an overall increase in variability in the test data. More specifically, the bias primarily shifted the data towards endometrial tumors: it pulled markers whose true mean differences are around zero away from zero, and some markers whose true mean differences are moderately positive close to zero. Most false positive markers had mean differences close to zero in the benchmark and were resulted from the negative biases in mean difference caused by array effects. Most false negative markers had positive mean differences in the benchmark and were resulted from the under-estimated magnitudes of mean difference. The level of increase in data variability is similar for the majority of markers. This increase partly contributed to the occurrence of false negative markers.
We generated similar scatter plots for the test data after median normalization ( Figure 2B and Figure 3B). Median normalization corrected the bias in mean difference caused by array effects but with some over-correction. It also decreased the level of data variability but also with a level of over-correction especially for markers whose variability was small in the benchmark. False positive markers after median normalization are primarily caused by over-estimated mean differences and under-estimated standard deviations. Most false positive markers were up-regulated in the benchmark. False negative markers were primarily resulted from under-estimated absolute mean differences, predominantly among markers that are down-regulated in the benchmark.
Similar plots showed that quantile normalization effectively corrected the bias caused by array effects; however, it overly compressed the variability of the data ( Figure 2C and Figure 3C). The false positive markers after quantile normalization are primarily caused by under-estimated standard deviations, while the remaining false negative markers are primarily due to underestimated magnitude of mean differences.

Conclusions
We utilized a pair of miRNA array datasets on the same set of tumor samples to perform an objective and absolute assessment of normalization performance under genuine data characteristics. Comparing with previous reports on the assessment of normalization methods for miRNA arrays, our approach provides an important alternative evaluation that challenges the critical assumption of the all-gene methods and offers insights on their performance when the assumptions are violated [25][26][27].
In the presence of array effects, normalization and batch adjustment can improve the accuracy of detecting differentially expressed markers. However, the number of false positive markers can still be close to the number of true positive markers as demonstrated in our study.
There is a critical need to develop efficient methods to normalize miRNA array data with both effective bias correction and proper variability reduction so that miRNAs having disease relevance can be identified in an accurate manner. Figure S1 Scatter plot comparing group means in the randomized dataset. True positive markers, false negative markers, and false positive markers, as determined in the nonrandomized data (without normalization) in comparison with the randomized dataset as the benchmark, are indicated as ''x'' in black, blue, and red, respectively. True negative markers are indicated as black dots. (DOCX) Figure S2 ROC curves comparing the two-sample tstatistic p-values for the test dataset with (A) no normalization, (B) median normalization, (C) quantile normalization, (D) cyclic loess normalization, and (E) variance stabilizing normalization, with the gold standard (that is, the differential expression status determined by the benchmark dataset). (DOCX) Figure S3 Comparison of differential expression analysis of the test data before versus after batch adjustment. Each Venn diagram compares differentially expressed markers identified in the test data after normalization following no BEC (yellow circle) versus those with BEC (using either standardization (green circle) or ComBat (blue circle)). (DOCX) Figure S4 Results of differential expression analysis of the test data before and after batch adjustment, in comparison with the gold standard derived from the benchmark data. Each Venn diagram compares differentially expressed markers identified in the test data after normalization following no BEC (yellow circle) or BEC (using standardization (green circle) or ComBat (blue circle)) versus those identified in the benchmark data (red circle). (DOCX)