Assessing Numerical Dependence in Gene Expression Summaries with the Jackknife Expression Difference

Statistical methods to test for differential expression traditionally assume that each gene's expression summaries are independent across arrays. When certain preprocessing methods are used to obtain those summaries, this assumption is not necessarily true. In general, the erroneous assumption of dependence results in a loss of statistical power. We introduce a diagnostic measure of numerical dependence for gene expression summaries from any preprocessing method and discuss the relative performance of several common preprocessing methods with respect to this measure. Some common preprocessing methods introduce non-trivial levels of numerical dependence. The issue of (between-array) dependence has received little if any attention in the literature, and researchers working with gene expression data should not take such properties for granted, or they risk unnecessarily losing statistical power.


Background
The expression values of thousands of genes can be monitored simultaneously using microarray technology [1,2]. Applications of this technology abound in the literature. This paper assumes that the reader is somewhat familiar with this technology, particularly the GeneChip microarray from Affymetrix (www.affymetrix.com), which is the most commonly used platform for gene expression studies. Some common terminology is defined herein only for the sake of clarity.
Preprocessing refers to the steps taken to convert the raw probelevel intensities to a collection of estimates of each gene's expression values on each array [3,4]. With the Affymetrix platform, preprocessing typically includes background correction (to remove local noise and other small artifacts), normalization (to make inter-array comparisons meaningful), and summarization (to combine probe-level data to a gene-level summary). A variety of preprocessing methods have been proposed for Affymetrix data, with MAS5 [5,6], Li-Wong (also referred to as dChip, or MBEI for model-based expression index) [7][8][9], RMA [3,10], GCRMA [11], PLIER [12,13], and PUMA [14][15][16] among the most commonly used. Each of these methods has a convenient implementation among the Bioconductor tools [17] for the R computing environment [18]. The result of each of these methods can be thought of as a matrix of gene expression estimates (or gene expression summaries), with a row for each gene and a column for each array in an experiment. Rather than fully summarizing each of these preprocessing methods here, we refer interested readers to the references.
After preprocessing, a wide variety of analysis options are available. When the arrays can be classified by some categorical variable, such as disease state (healthy vs. beginning disease vs. advanced disease, for example) or treatment state (control vs. treatment, for example), a test of differential expression can be considered. A test of significance is conducted to identify individual genes (or groups of genes) that exhibit systematic shifts in expression values between levels of the categorical variable.
While there are perhaps less than a dozen major preprocessing methods in the literature (plus their variants), the number of proposed methods for evaluating differential expression continues to grow. We do not attempt to catalog every possible test here, nor do we claim to have a best test. Instead, we focus our attention on a common assumption in these tests, that a gene's expression summaries from multiple arrays are independent. This is different from the issue of dependence among genes, which has been addressed previously by others [19,20]. The linear models framework in the limma approach [21] assumes the independence of a gene's expression levels, with any dependence ''assumed to be such that it can be ignored to a first order approximation.'' Other t-statistic-based approaches such as SAM [22] also implicitly assume this independence.
Depending on the preprocessing method, the expression summaries for a given gene may not be truly independent across arrays. For example, RMA essentially shares information across arrays at both the (quantile) normalization and (median polish) summarization steps, so the RMA expression summaries on one array will depend to some degree on the original intensities on other arrays. On the other hand, MAS5 preprocesses each array individually, sharing no information across arrays at any step of preprocessing.
A general principle of statistical inference is that if model assumptions are violated, no claim of statistical significance can be made. A common goal of statistical applications to gene expression data is to perform statistical inference by identifying significantly differentially expressed genes. We seek to draw attention to the fact that any such statistical inference is suspect when the assumption of independence is violated. Our motivation in this paper is primarily to shed light on the numerical properties of several common gene expression summaries, as they relate to this assumption of independence, rather than to account for dependence in a particular test for differential expression.

Illustrative Scenario
To illustrate the impact of erroneously assuming independence, we present a small illustrative scenario. We emphasize that this small scenario is merely used to illustrate the principle that ignoring dependence matters in statistical inference, and hope that this scenario does not detract from the main focus of this paper, which is given in the Methods section.
Consider a two-sample z-test, where for replicate i of treatment j (i~1, . . . ,n; j~1,2), where the indicator function I ½j~2~1 when j~2, and equals 0 otherwise. Here, the vector E~(E 1,1 , . . . ,E 2,n ) is multivariate normal with mean 0 and compound symmetric covariance matrix S: That is, S is 1 on the diagonal and r for all off-diagonal elements, with 0vrv1 defining the degree of dependence. This scenario can be represented in matrix form: where X is the 2n|2 design matrix with all 19s in the first column, and with n 09s followed by n 19s in the second column, and b * is the vector of ''intercept'' and ''slope'' (or ''treatment effect'') parameters and E*N(0,S) so that Y * *N(X b * ,S), where the vector Y *~( Y 1,1 , . . . ,Y 2,n ).
Using ordinary least squares (i.e., ignoring the dependence r) and linear models theory [23], Here, the (o) in superscript is for ordinary least squares. It can be shown (using a symbolic computation package such as Maple) that the n (1{r). Based on this ordinary least squares approach (which assumes r~0 If b Ã 1 is the true value of b 1 and z a=2 is the upper a=2 critical value of the standard normal distribution, then the statistical power for the test of H 0 : b 1~0 (while ignoring dependence) is where Z Ã is a truly N(0,1) random variable. Specifically, Using weighted least squares (i.e., accounting for the dependence r) and linear models theory [23], Here, the (w) in superscript is for weighted least squares. It can be shown (using a symbolic computation package such as Maple) that the variance of n (1{r). Based on this weighted least squares The statistical power for the test of H 0 : b 1~0 (while accounting for dependence) is ð5Þ where Z ÃÃ is a truly N(0,1) random variable. Specifically, We can compare the statistical power when dependence is ignored (P (o) in Equation 7) with the statistical power when dependence is accounted for (P (w) in Equation 10) by focusing on the left and right endpoints of their respective final probability formulae. If 0vrv1, then 1= ffiffiffiffiffiffiffiffiffiffi and It follows then that P (o) vP (w) . The contour plots in Figure 1 summarize this difference in power for a range of r and b 1 values. Clearly, greater magnitude of ''treatment effect'' b 1 leads to greater statistical power at any given level of dependence r. However, ignoring dependence leads to a loss of statistical power, with greater losses for greater dependence (higher r) and more subtle magnitudes of ''treatment effect'' (smaller b 1 ). Although the tests for differential expression with real gene expression data may be different than a simple twosample z-test, this general principle remains -that erroneously assuming independence leads to a loss of statistical power. This motivates our attention to the numerical dependence introduced by various common preprocessing methods.

Jackknife Expression Difference (JED)
Here we propose a simple method to assess the (between-array) numerical dependence of gene expression summaries. For a particular gene, letm m x andm m y be the gene's non-negative log-scale expression level summaries for arrays x and y, respectively, after some preprocessing method. (For certain preprocessing methods, including PLIER and PUMA, it is possible to find negative expression summaries. We treat such cases as having very little evidence of expression, and reset negative expression summaries on an array to the smallest positive expression summary observed for all genes on the array.) For the same gene, letm m x(y) be the expression level estimate for array x when array y is not included in any step of the preprocessing, with conventionm m x(x) :0 to represent no information for array x when excluded. Then we define the Jackknife Expression Difference (JED) between arrays x and y for the gene to be Notice that by definition, 0ƒJED(x,y)ƒ1, and JED(x,x)~1, indicating strict numerical dependence of an array (or its summaries) with itself. Also, this JED measure is standardized such that JED(x,y)~0 whenm m x andm m y are strictly numerically independent, i.e., whenm m x(y)~m m x andm m y(x)~m m y . JED(x,y) can be interpreted as the average percent change in the expression value of the gene because of the inclusion or exclusion of arrays x and y in the preprocessing.
JED values of 0 indicate total independence between pairs of arrays, while values of 1 indicate total dependence between pairs of arrays. If JED(x,y)~0:25 for a particular gene and arrays x and y, then the expression value of the gene on those two arrays would change by an average 25% if either array had not been included in the study. Incidentally, defines a distance function for the gene's expression summaries between arrays x and y. The jackknife approach can be considered the simplest of resampling techniques [24], and while it can exclude more than one at a time, the most common application of the jackknife principle is ''leave-one-out'' [25]. In the multi-array gene expression situation, this allows for pairwise (between array) distance comparisons by dropping (one at a time) members of pairs of arrays (x and y). Other resampling approaches such as the general jackknife (leaving out more than one) or the bootstrap (drawing at random with replacement) do not lend themselves so easily to this pairwise interpretation. The R code to obtain this JED measure is provided (with an example) in Text S1.

Covariance and JED
The JED measure assesses numerical dependence in gene expression summaries. While similar in spirit, this numerical dependence is not the same as what we refer to as statistical dependence, which could be represented by a true correlation or covariance matrix for each gene. Ifm m * is the vector of expression summaries (for all arrays) for a given gene under a particular preprocessing method, then the covariance matrix would be Constructing such a per-gene covariance matrix for a given preprocessing method would require a well-defined distribution for the method's gene expression summaries (m m * ). In practice, such well-defined distributions are rare (and unheard of) for preprocessing methods, and it is usually not possible to estimate this matrix V . For some preprocessing methods, however, the diagonal elements of V (the variances of the expression summaries) can be estimated, either in closed form based on the distribution ofm m * (as for Li-Wong and for PUMA), or as an approximation using the bootstrap (as for RMA [26]).
To investigate the general relationship between a gene's Jackknife Expression Difference JED(x,y) and the covariance V xy for a pair of arrays x and y, we define a preprocessing method we will refer to as MINDEP (for minimum dependency). We emphasize that we do not recommend using this preprocessing method in general; we only use it here because its resulting covariance matrix V can be obtained using standard statistical theory. In this MINDEP approach, no background correction and no normalization is done, and a two-factor linear ANOVA-type model is assumed for each gene at the summarization step: Here, Y xj is the log-scale perfect match intensity for probe j on array x, A x is the mean array effect, and P j is the mean probe effect. Letĥ h be the vector of resulting ordinary least squares parameter estimates of the A x 's and P j 's, with covariance matrix S S~Cov(ĥ h). This covariance matrix can be obtained because of the well-known properties of least squares estimates [23]. The LSMEAN (or marginal mean or population mean) for the gene on array x is defined as where J is the number of probes for the gene. Then the MINDEP expression summary for the gene on array x is defined aŝ Here, w is a weight parameter ranging from 0 to 1, and a x is a vector of appropriate coefficients (specific to array x).
For the sake of completeness, we briefly show the construction of a x from Equation 18. Let X be the number of arrays and J be the number of probes for a given gene. Corresponding to array x, a x is a length X zJ vector with th element a x, . For 1ƒxvX , a x,~0 for vX and =x, a x,x~1 , and a x,X~{ 1. For x~X , a x,~0 for ƒX . For 1ƒxƒX and 1ƒjƒJ, a x,X zj~1 =J. For example, if there were X~3 arrays and J~5 probes, then the three vectors a 1 , a 2 , a 3 would be the rows of the matrix The subtraction of the minimum array mean in Equation 18 is intended to serve as a pseudo-background-correction, and larger w introduces greater dependence between the resulting expression summaries, with known covariance between arrays x and y: Thus for each gene, we can obtain a vector of MINDEP expression estimates m and its corresponding covariance matrix V . The weight parameter w can be varied to show the simultaneous effect of greater dependence on covariance and JED.
We note that the weight parameter w in Equation 18 could be set to give negative covariance values between arrays x and y. However, by definition (and via the built-in symmetry), JED is non-negative. This helps preserve its interpretation.

Results
For illustration purposes, we applied this JED measure for six common preprocessing methods to four datasets. The publiclyavailable Affymetrix HGU95A spike-in data [27] consist of 59 arrays and 12,626 probesets on each array. For our demonstration, only 8 arrays were used, corresponding to groups M-T of wafer 1532 of the spike-in data. We also applied JED to the publicly-available Platinum Spike [28] data set (18,952 probesets on each of 18 arrays) and the publicly-available Golden Spike [29] data set (14,010 probesets on each of 6 arrays) as well as a previously published Asbestos [30] data set (54,675 probesets on each of 6 arrays). Because the results from these four different data sets were so similar, we do not fully report the results from each. Unless otherwise specified, the results given here are for the HGU95A spike-in data. for all x=y). The popular RMA method introduces some numerical dependence, but the dependence is certainly not as substantial as that observed in other methods.

Visual Summaries of JED
We considered whether the JED measure preserves some biological or chemical aspect of the genes. If it did, we would expect to see similarities in JED measures from different preprocessing methods, especially similar preprocessing methods. Figure 3 compares the JED measures for RMA and GCRMA, which share the same quantile approach at the normalization step and the same median polish approach [31] at the summarization step of preprocessing. (Figures 3 and 4 make use of hexagonal binning [32] in the scatter plots, with darker colors indicating greater density of points.) Based on Figure 3, there is no evidence that the JED measures from these two preprocessing methods are related, even though they share two preprocessing steps. Similar non-relation results (not shown) are observed for the other pairs of preprocessing methods that do not share preprocessing steps. This suggests that the JED measure reports numerical artifacts of the preprocessing method, and is not biological or chemical in origin.
We also considered if the magnitude of the JED measure might be related to the corresponding magnitude of expression. Figure 4A compares the JED measure for RMA with the pairwise mean RMA expression summary. That is, for each gene, and for each pair of arrays x and y, JED(x,y) is plotted against (m m x zm m y )=2. For purposes of visualization, points corresponding to the same array pair (x~y, where JED~1) are omitted. The largest JED values correspond to lower-expressed genes, but relatively large JED values can be observed for higher-expressed genes. Similar results are observed for other preprocessing methods, including PLIER as in Figure 4B. We note with some concern that some large PLIER expression values (around 10) have moderately large JED values (around 0.35), such that some of the most highly expressed genes (after PLIER preprocessing) are subject to about 35% average change in expression based on the inclusion or exclusion of some arrays. While the results of Figure 4 are for this sample HGU95A data set, the trends seen here raise concern about the levels of numerical dependence introduced by some preprocessing methods, even for more highly-expressed genes. Figure 4B shows some banding near PLIER JED values of 0.5 and 1, which are an artifact of sign changes induced by the jackknife. For example, for a given gene and arrays x and y, it could be thatm m x w0 butm m x(y) v0, so that the jackknife (exclusion of array y) induces a sign change for the gene's expression summary on array x. Similarly, exclusion of array x could induce a sign change for the gene's expression summary on array y. In both cases, the sign change could go from positive to negative or from negative to positive. For each gene and each pair of arrays (x, y), the number of sign changes induced by the jackknife will be 0, 1, or 2. Figure 5 summarizes the PLIER JED values by number of observed sign changes, with clear banding at 0.5 for genes (and array pairs) with one sign change, and at 1 for those with two sign changes. From the Methods section above, recall that we treat a negative expression summary as having very little evidence of expression, and reset such negative expression summaries on an array to the smallest positive expression summary observed for all genes on the array. Let E x w0 be the smallest positive expression summary observed for all genes on array x, and E x(y) w0 be the smallest positive expression summary observed for all genes on array x when array y is excluded from the preprocessing. Then if a gene exhibits a sign change on array x upon exclusion of array y (for example,m m x w0, butm m x(y) v0 is reset to E x(y) w0), the first portion of the JED calculation in Equation 13 is  Dm

Numerical Artifact Due to Sign Changes
This explains the pattern near JED~0:5 observed for genes (and array pairs) with one sign change in Figure 5. If the gene (and array pair) has two sign changes induced by the jackknife, then both portions of the JED calculation in Equation 13 will be approximately 1 2 (as in Equation 21), explaining the pattern near JED~1 for genes (and array pairs) with two sign changes in Figure 5. It is important to point out that even if one focuses only on genes with positive expression summaries (zero sign changes in Figure 5), very high JED values can be seen for PLIER. Similar results (not shown here) can be seen for PUMA, the other preprocessing method considered here with possibly negative expression summaries.

JED and Correlation
Using the previously defined MINDEP preprocessing method, we considered the general relationship between JED and correlation (rescaled covariance) between expression summaries. The trellis plot in Figure 6 summarizes the result. (Like Figures 3  and 4, Figure 6 also makes use of hexagonal binning [32] in the scatter plots, with darker colors indicating greater density of points.) At any given weight parameter value (w in Equation 18), there is no direct relationship between JED and correlation, so JED cannot be used as a direct proxy for correlation. However, looking across a range of weight parameter values, a general relationship can be observed, allowing general statements about the dependence level induced by a given preprocessing method. In this context, we think of MINDEP for different weight values w as being different preprocessing methods. Using weight w~0 in MINDEP, there is no dependence introduced, and both JED and correlation (between expression estimates for a gene on array pairs) are 0. As the weight parameter increases towards 1, the correlations overall increase, with the main distribution of correlation values centering around 0.7 for the higher weights. At the same time, as the weight parameter increases towards 1, the JED values' range increases, with larger JED values becoming more common. In other words, the proliferation of larger JED values is indicative of higher underlying correlations being possible. Such a general relationship can only be shown explicitly for a preprocessing method like MINDEP, where correlation (scaled covariance) can be calculated.

Discussion
Throughout this paper, we have used the term ''numerical dependence'' as a convenient descriptive term to distinguish from ''statistical dependence.'' In reality the JED measure is also related to the notion of robustness (of the gene expression estimate on one array to the inclusion/exclusion of another array for/from consideration). In general, it is not always clear how to statistically define robustness [24], and in the specific case of the JED measure, there is no direct translation to correlation. We investigated several approaches to incorporate our JED measure into an estimate of the covariance matrix V (Equation 15) for this purpose, but finally concluded that while numerical dependence can be assessed via the JED measure, it can not be used to define statistical dependence in a general way. For that reason, we do not present any method to account for numerical dependence in a test for differential expression. We do note, however, that some available tests for differential expression use probe-level rather than fully preprocessed data, so the dependence issue is less of a concern for those methods, which are particularly well-suited to small-sample studies [33].
In presenting the JED here, we are very careful to state that we only propose to use the JED measure as a diagnostic comparison of preprocessing methods, and not for inference; in fact we emphasize that it can not be contorted to fit the purpose of inference. The results of Figure 6 indicate that while JED values cannot be used as a direct proxy for correlation values for any given array pair for any given gene, the JED can be used as a diagnostic to assess the relative amounts of dependence induced by various preprocessing methods.
The JED measure does not estimate a particular parameter -it only provides a summary of the amount of numerical information shared between arrays in calculating gene expression estimates. Because it does not pertain to a defined parameter (but rather to the notion of robustness), the JED measure does not lend itself to hypothesis testing or thresholds of statistical significance. For this reason, we do not propose cut-offs for ''acceptable'' JED values. Such thresholds (perhaps for ''failure'' of a preprocessing method) would of necessity be subjective because the relationship between JED and statistical correlation will depend on the preprocessing method (and not, in general, be known for the most common preprocessing methods). Instead, we propose and present here an objective evaluation of several preprocessing methods by demonstrating their JED performance on multiple real data sets (in the Results section). A wider range (and larger extremes) in JED values is indicative of greater induced numerical dependence. A gene's JED value for a pair of arrays is interpreted as the average percent change in the gene's expression value based on the inclusion or exclusion of each array, and as such, is an interesting diagnostic in its own right (even without incorporation to a test of differential expression). For example, there are some moderately large JED values (around 0.35) in Figure 4B for genes with expression values around 10; the interpretation of these values is that those genes (after PLIER preprocessing) are subject to about 35% average change in expression based on the inclusion or exclusion of some arrays. Ideally, there would be no numerical dependence induced by preprocessing (JED~0). However, it is not only the existence of extreme values that concern us, but the abundance of large JED values (or the skewing of the JED distribution towards 1) in some preprocessing methods (Figures 2 and 6) that we note with alarm.
The JED measure presented here can be used to comparatively assess the numerical independence of gene expression summaries from any given preprocessing method. In fact, this is the primary strength of the JED measure. Better-known measures of dependence or correlation require knowledge of distributions or wellknown statistical properties of estimates, which is not the case for most common preprocessing methods. For example, considerm m x andm m y as estimated expression values for a given gene on arrays x and y, respectively. Just to calculate the simple covariance However, for most commonly-used preprocessing methods, the probability distribution of estimated expression values is not known (or even assumed!), and their statistical properties are not well-known. The JED measure provides a way to quickly summarize some notion of dependence between arrays for any preprocessing method, with no need to know its distributional properties.
We emphasize that the JED measure is not a diagnostic of arrays or samples or genes, but of preprocessing methods. We do not propose (and in fact actively discourage) the use of JED for other purposes such as, for example, to identify significantly correlated arrays. While it could be shown for some preprocessing methods that lower JED values roughly correspond to higher correlations between arrays, we discourage this approach (and do not show the results of a simulation we considered to address this very point) for two reasons. First, if an analysis objective is to identify significantly correlated arrays, it is conceptually and computationally far more simple to look at scatterplots of log-scale PM (between pairs of arrays) or something similar than to use the JED. Second, the JED measure has no basis for inference; it is simply a descriptive statistic that, viewed across many genes in several microarray studies (as we have done here), provides insight to the relative levels of numerical dependence induced by various preprocessing methods. This is its sole intended purpose. The JED's performance (in assessing relative amounts of numerical dependence from various preprocessing methods) can only be assessed by repeated application to several data sets, as we have done here. Any JED-based inference would, of necessity, require knowledge of the statistical properties of the JED measure. As discussed in the ''JED and Correlation'' section above as well as the preceding paragraph, such knowledge is unavailable for the commonly-used preprocessing methods, but fortunately such knowledge is also unnecessary for using the JED in its intended purpose.
Even though a preprocessing method may demonstrate stricter independence in the JED sense (such as MAS5 in Figure 2), it is not necessarily the ''best'' preprocessing method. Other measures such as bias and performance on spike-in datasets [34,35] are important to consider in the selection of a preprocessing method. We do not recommend any particular method here, but note in passing that the popular RMA method demonstrates only modest numerical dependence in comparison to some other methods currently used in the literature (Figure 2).
Newer technologies such as RNA-Seq are of course becoming more common for gene expression experiments, and statistical methods are being developed for their appropriate analysis [36,37]. However, microarrays remain a vital research tool in many fields where an organism's transcriptome is fully defined, and funds are limited. Furthermore, the vast archives of publiclyavailable microarray data (most notably, NCBI's GEO [38]) serve as a rich resource for targeted hypothesis generation and validation in modern studies, and their use is active and ongoing [39]. The appropriate analysis of microarray data (including appropriate application of independence assumptions) will continue to lead to new biological insights.
Motivated by a desire to avoid lost statistical power (as demonstrated by Figure 1 and the Introduction section above) in tests for differential expression, we encourage the use of preprocessing methods with lower numerical dependence. The JED measure here can assess some notion of dependence for any preprocessing method, even when the distributional properties of the method's expression values are unknown. By doing so, we wish to draw attention to the underlying assumption of (between-array) independence in gene expression summaries for tests of differential expression. This issue of (between-array) independence has received little if any attention in the literature, and researchers working with gene expression data should not take these properties for granted, or they risk unnecessarily losing statistical power.

Supporting Information
Text S1 A .txt file providing the R code to obtain this JED measure (with an example). (TXT)