^{1}

^{1}

^{2}

^{*}

JTL and JDS conceived and designed the experiments, performed the experiments, contributed reagents/materials/analysis tools, and wrote the paper. JTL analyzed the data.

The authors have declared that no competing interests exist.

It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

Large-scale gene expression studies allow one to characterize transcriptional variation with respect to measured variables of interest, such as differing environments, treatments, time points, phenotypes, or clinical outcomes. However, a number of unmeasured or unmodeled factors may also influence the expression of any particular gene. Besides inducing widespread dependence in measurements across genes [

We call “primary measured variables” (or primary variables) those variables that are explicitly modeled in the analysis of an expression study. These variables may or may not be associated with any given gene's expression variation. We classify all the remaining sources of expression variation into three basic types. “Unmodeled factors” are sources of variation explained by measured variables, but are not explicitly included in the statistical model (e.g., because their relationship to expression is intractable or the relevant measured variables were excluded because of sample size restrictions). “Unmeasured factors” are sources of expression variation that are not measured in the course of the study, so we also call these unmodeled factors. Finally, “gene-specific noise” refers to random fluctuations in gene expression independently realized from gene to gene.

As a simple example meant only for illustrative purposes, consider a human expression study where disease state on a particular tissue type is the primary variable. Suppose that in addition to changes in expression being associated with disease state, the age of the individuals also has a substantial influence on expression. Thus, some genes exhibit differential expression with respect to disease state, some with respect to age, and some with respect to both. If age is not included in the model when identifying differential expression with respect to disease state, we show that this may (a) induce extra variability in the expression levels due to the effect of age, decreasing our power to detect associations with disease state, (b) introduce spurious signal due to the fact that the effect of age on expression may be confounded with disease state, or (c) induce long-range dependence in the apparent “noise” of the expression data, complicating any assessment of statistical significance for differential expression. In practice, even if age were known, it may be one of dozens of available measured factors, making it statistically intractable to determine which to include in the model. Furthermore, even measured factors such as age may act on distinct sets of genes in different ways, or may interact with an unobserved factor, making the effect of age on expression difficult to model. “Expression heterogeneity” (EH) is used here to describe patterns of variation due to any unmodeled factor.

Major sources of expression variation are due to technical [

In each of these studies, expression variation with respect to one or at most a handful of variables is explored. However, it is likely that in each study multiple sources of EH will act on distinct, but possibly overlapping, sets of genes. Normalization techniques are commonly used to detect and adjust for systematic expression variation due to well-characterized laboratory and technical sources [

Here, we introduce “surrogate variable analysis” (SVA) to identify, estimate, and utilize the components of EH.

One thousand gene expression datasets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples.

(A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal).

(B) For each simulated dataset, a Kolmogorov-Smirnov test was employed to assess whether the

(C) A plot of expected true positives versus FDR for the SVA-adjusted (solid) and -unadjusted (dashed) analyses. The SVA-adjusted analysis shows increased power to detect true differential expression.

We apply SVA to three distinct expression studies [

We have developed an approach called surrogate variable analysis that appropriately borrows information across genes to estimate the large-scale effects of all unmodeled factors directly from the expression data.

(A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays.

(B) Genes 1–300 in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array.

(C) Genes 201–500 in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with, the group variable. Here, the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates.

The four-step procedure is necessary both to ensure that the surrogate variables indeed estimate EH and not the signal from the primary variable (Step 1), to ensure an accurate estimate of each surrogate variable by identifying the specific subset of genes driving each EH signature (Step 2), to allow for correlation between the primary variable and the surrogate variables by building the surrogate variables on the original expression data (Step 3), and to take into account the fact that a surrogate variable may have a different effect on each gene (Step 4). The third and fourth steps are particularly important for maintaining unbiased significance with SVA, as demonstrated below.

The overall goal of SVA is to provide a more accurate and reproducible parsing of signal and noise in the analysis of an expression study when EH is present. One way in which signal is commonly quantified is through a significance analysis [

We performed a simulation study to investigate the properties of SVA with respect to large-scale significance testing. Specifically, we show that the SVA algorithm (a) accurately estimates signatures of expression heterogeneity, (b) corrects the null distribution of

We first assessed the accuracy of the surrogate variables estimated from SVA. In 99.5% of the simulated studies, a permutation procedure [^{2} value.) The average correlation between the estimated surrogate variable and the true unmodeled factor over all 1,000 experiments was 0.95 with a standard deviation of 0.05. Each surrogate variable is a weighted average of the expression measurements over a subset of genes. We chose a liberal adaptive cutoff for determining the number of genes affected by each orthogonal EH signal to avoid overfitting. The SVA algorithm correctly identifies the genes affected by the unmodeled factor. On average, 30.5% of the truly affected genes were identified as affected, whereas only 9.9% of the truly unaffected genes were identified as affected.

It is well known that in a significance analysis,

These noteworthy fluctuations and biases in the null

It should also be noted that

Perhaps most importantly, SVA also results in a more powerful and reproducible ranking of genes for differential expression. This can be seen in

There has been much recent interest in the effect of expression dependence across genes on estimates of multiple testing significance measures. Large-scale dependence has been shown to be particularly problematic for estimating FDR, as dependence across genes increases the variance of most standard FDR estimators [

To assess the accuracy of the SVA algorithm in the case where the primary variable and unmodeled factors are heavily correlated, we performed a second simulation study. The set-up for the second simulation study was identical to that for the original study above, except in this case the unmodeled factor was simulated such that the average correlation with the primary variable was 0.50 with a standard deviation of 0.16. Under this model, the unobserved factor is both correlated with the primary variable and affects an overlapping set of genes. This is representative of the potential confounding present in observational microarray studies (see Disease Class below) and that which happens by chance in a non-negligible subset of randomized studies. Even in this set-up, the permutation hypothesis test correctly identified a single surrogate variable in 94.5% of the simulated datasets. Further, the average correlation between the estimated surrogate variable and the true unmodeled factor over 1,000 datasets was 0.94 with a standard deviation of 0.22. Thus, SVA accurately estimates the unobserved factor even when there is strong dependence between the primary and unobserved factors, with a subset of genes affected by both. SVA also provided a correct Uniform distribution of null

Several recent studies have carried out the genetic dissection of expression variation at the genome-wide level [

As proof of concept, the Brem et al. [

A number of expression traits have significant

(A) A plot of significant linkage peaks (

(B) Significant linkage peaks (

Pivotal

Significance Results

We applied the SVA approach to two human studies [

Hedenfalk et al. [

Hierarchical clustering [

(A) A plot of the top surrogate variable estimated from the breast cancer data [

(B) A plot of tissue type versus array for the Rodwell et al. [

As shown above, SVA also increases the accuracy and stability of the ordering of the significant gene lists (see Simulated Examples). Since it is standard practice to examine only the most significant genes for further study, an SVA-adjusted analysis may result in completely distinct biological conclusions. For example,

Rodwell et al. [

We then applied SVA to the expression data ignoring the tissue information. The top surrogate variable identified by SVA had a correlation of 0.86 with tissue type (

At standard

There are several well-established statistical approaches for partitioning of sources of variation among multiple variables into components [

When performing a significance analysis of an expression study with respect to primary variables, one cannot employ this classical approach. As opposed to association studies, where population structure has genome-wide effects at a signal relatively much stronger than the primary variable, the signal structure in expression studies tends to be much more complex. There can be multiple levels of signal from multiple sources that each affect certain subsets of genes, making it important to supervise the decomposition with respect to known primary variables and these subsets of genes.

To demonstrate these issues, we considered two straightforward significance analysis applications of the well-established singular value decomposition approach previously utilized in genomics [

SVA is a new methodological development aimed at overcoming the issues not addressed by existing methods. Rather than decomposing the entire expression matrix (or genotype matrix), SVA performs what could be called a “supervised factor analysis” of the expression data (

It is clear that EH induces widespread dependence in expression variation across genes. EH is therefore related to the issue of multiple testing dependence, which has been recognized as an important problem [

A histogram of the null

If the original data are ignored and an adjustment for EH is applied to the

Expression heterogeneity due to technical, genetic, environmental, or demographic variables is common in gene expression studies. Here we have introduced a new method, SVA, for identifying, estimating, and incorporating sources of EH in an expression analysis. SVA uses the expression data itself to identify groups of genes affected by each unobserved factor and estimates the factor based on the expression of those genes. Simulations show that SVA accurately detects expression heterogeneity and improves significance analyses. We performed SVA on experiments involving recombinant inbred lines, individuals of varying disease state, and expression measured over time to illustrate the broad range of studies on which SVA can be applied. One advantage of the SVA approach is the ability to disentangle correlated and overlapping differential expression signals. This approach may be particularly useful in clinical studies, where a large number of clinical variables may have a complicated joint impact on expression. We have implemented SVA in an open source package available for downloading at

Three publicly available datasets were employed to represent a broad range of gene expression studies performed in practice. The first dataset consists of gene expression measurements for 6,216 genes in 112 segregants of a cross between two isogenic strains of yeast, as well as genotypes across 3,312 markers [

The SVA algorithm identified 14 significant surrogate variables from the expression data. We performed both an unadjusted and an SVA-adjusted linkage analysis for each expression trait. In the unadjusted analysis, linkage

For each study, we simulated expression for 1,000 genes on 20 arrays divided between the two disease states. For simplicity, the expression measurements for each gene were drawn from a normal distribution with mean zero and variance one. We simulated expression heterogeneity with a dichotomous unmodeled factor independent of the disease state. The mean differences between disease states and states of the unmodeled factor were drawn from two independent normal distributions. For the real data example, we calculated the residuals from the regression of

Differential expression was calculated using a

Let _{mxn}_{1},..,_{m}^{T}_{i}_{i}_{1}_{in}^{T}_{1}_{n}^{T}

Without loss of generality model x_{ij} = μ_{i} + f_{i}_{j}_{ij}_{i}_{i}_{j}_{ij}_{j}_{i}_{ij}_{j}_{ij}_{i}_{i} y_{j}_{ij}_{i}_{i}_{i}

Suppose in a microarray study there are _{ℓ} = (g_{ℓ1},...,g_{ℓn}) be an arbitrarily complicated function of the ℓth factor across all _{ij}_{i}_{i}_{j}_{ℓi} is a gene-specific coefficient for the ℓth unmodeled factor. If unmodeled factor ℓ does not influence the expression of gene _{ℓi} = 0. The fact that we employ an additive model is actually quite general: it has been shown that even complicated nonlinear functions of factors can be represented in an additive fashion for a reasonable choice of a nonlinear basis [_{ℓ} to be as nonlinear as necessary and make

Due to this formulation, the inter-gene dependent _{ij}_{ij}

It is not possible in general to directly estimate the unmodeled _{ℓ}, and SVA does not attempt to do so. The key observation is to note that for _{k}_{ℓ} In other words, for any set of vectors _{ℓ} and coefficients _{ℓi}, it is possible to identify mutually orthogonal vectors _{k} and coefficients _{ki}

Therefore, we do not need to estimate the specific variables _{ℓ}. We only need to estimate the linear combination _{k}_{1},_{2},...,_{K}

An intuitive question that arises from an inspection of this formulation is about the model assumptions of the g_{ℓj}. Whereas the term _{i}_{j}_{j}_{ℓj} as a function of a well-defined, measured variable. Since we estimate the outcomes _{ℓj} in terms of a biologically meaningful variable. Thus, we can bypass the need to know what the most relevant model of a measured variable is for g_{ℓj} for the purposes of estimating the EH.

The goal of the SVA algorithm is therefore to identify and estimate the surrogate variables, _{k}_{k}_{1},...,_{kn}^{T}

The algorithm is decomposed into two parts: detection of unmodeled factors and construction of surrogate variables. The basic form of the first algorithm has been proposed previously [

1. Form estimates _{ij}_{i}_{i}_{j}_{ij}_{ij}_{ij}_{j}_{ij}

2. Calculate the singular value decomposition of the residual expression matrix ^{T}

3. Let _{ℓ} be the ℓth eigenvalue, which is the ℓth diagonal element of _{i}_{i}_{j}

4. Form a matrix ^{*}^{*}

5. Fit the model _{0}.

6. Calculate the singular value decomposition of the centered and permuted expression matrix _{0} = _{0}_{0}

7. For eigengene _{0ℓ} is the ℓth diagonal element of _{0}.

8. Repeat steps 4−7 a total of

9. Compute the

Since eigengene _{k}_{k}_{−1}, _{k}

10. For a user-chosen significance level 0≤α≤1, call eigengene _{k}

1. Form estimates _{ij}_{i}_{i}_{j}_{ij}_{ij}_{ij}_{j}_{ij}

2. Calculate the singular value decomposition of the residual expression matrix ^{T}_{k}_{k}_{1}_{kn}^{T}_{k}

3. Set

_{k} k=1,...,

4. Regress _{k}_{i}_{k}

5. Let _{0} be the proportion of genes with expression not truly associated with _{k}

6. Form the _{k}_{r}

7. Let ^{*} = argmax_{1≤j≤n}

8. In any subsequent analysis, employ the model _{ij}_{i}_{i}_{j}_{ij}_{i}_{i}_{j}

The singular value decomposition is employed in these SVA algorithms. It may be possible to utilize other decomposition methods, but since the singular value decomposition provides uncorrelated variables that decompose the data in an additive linear fashion with the goal of minimizing the sum of squares, we found this to be the most appropriate decomposition. If the primary variables are modeled for data that are not continuous, then it may make sense to decompose the variation with respect to whatever model-fitting criteria will be employed

SVA has been made freely available as an R package at

Heatmaps of hierarchically clustered gene expression data for a random subset of 1,000 genes from three studies are shown. (A) Hedenfalk et al. [

(849 KB PDF)

Histograms of the null

(17 KB PDF)

Histograms of the null

(18 KB PDF)

For each simulated dataset based on the permuted residuals from the Hedenfalk et al. study, a nested Kolmogorov-Smirnov test was employed to assess whether the

(62 KB PDF)

A plot of the true rank (according to signal-to-noise ratio) versus the significance test–based average rank (black) plus or minus one standard deviation (red) for each differentially expressed gene in simulated studies (A) affected by EH with an unadjusted analysis, (B) affected by EH with an SVA-adjusted analysis, and (C) unaffected by EH.

(439 KB PDF)

(A) A histogram of the estimates of the proportion of true nulls _{0} for studies affected by EH. (B) A histogram of the estimates of the proportion of true nulls _{0} for studies affected by EH, after adjusting for SVA. (C) A histogram of the estimates of the proportion of true nulls _{0} for studies without EH. (D) A plot of observed FDR versus true FDR (grey) and average observed FDR versus true FDR (red) for simulated studies affected by EH. (E) A plot of observed FDR versus true FDR (grey) and average observed FDR versus true FDR (red) for simulated studies affected by EH, adjusted by SVA. (F) A plot of observed FDR versus true FDR (grey) and average observed FDR versus true FDR (red) for simulated studies without EH.

(265 KB PDF)

(A) A plot of the top surrogate variable from the breast cancer data of Hedenfalk et al. [

(87 KB PDF)

A plot of the

(181 KB PDF)

Histograms of the null

(16 KB PDF)

Histograms of the null

(16 KB PDF)

For 1,000 simulated datasets based on the Normal residuals, a nested Kolmogorov-Smirnov test was employed to assess whether the

(62 KB PDF)

(50 KB PDF)

We thank the investigators of the Inflammation and the Host Response to Injury Consortium (

expression heterogeneity

false discovery rate

quantitative trait locus

surrogate variable analysis