Screening for interaction effects in gene expression data

Expression quantitative trait (eQTL) studies are a powerful tool for identifying genetic variants that affect levels of messenger RNA. Since gene expression is controlled by a complex network of gene-regulating factors, one way to identify these factors is to search for interaction effects between genetic variants and mRNA levels of transcription factors (TFs) and their respective target genes. However, identification of interaction effects in gene expression data pose a variety of methodological challenges, and it has become clear that such analyses should be conducted and interpreted with caution. Investigating the validity and interpretability of several interaction tests when screening for eQTL SNPs whose effect on the target gene expression is modified by the expression level of a transcription factor, we characterized two important methodological issues. First, we stress the scale-dependency of interaction effects and highlight that commonly applied transformation of gene expression data can induce or remove interactions, making interpretation of results more challenging. We then demonstrate that, in the setting of moderate to strong interaction effects on the order of what may be reasonably expected for eQTL studies, standard interaction screening can be biased due to heteroscedasticity induced by true interactions. Using simulation and real data analysis, we outline a set of reasonable minimum conditions and sample size requirements for reliable detection of variant-by-environment and variant-by-TF interactions using the heteroscedasticity consistent covariance-based approach.


Introduction
Gene-gene and gene-environment interaction effects on common human traits and diseases have been difficult to identify [1]. Part of the challenge is the small effect size of genetic variants on macro-phenotypes (e.g. disease status or anthropometric traits). Assuming that interactions PLOS  have effect sizes of the same magnitude as marginal genetic effects, the sample size needed to detect them can be up to an order of magnitude larger [2]. In order to circumvent this issue, researchers have performed screening for interaction effects on intermediate phenotypes (e.g., gene expression, proteomic, metabolomic) that presumably are directly affected by genetic variation in a causal pathway from variant to disease phenotype [3][4][5][6]. Indeed, reported marginal effects of single nucleotide polymorphisms (SNP) on gene expression are often substantially higher than those reported in genome-wide association studies (GWAS) of common traits and diseases. It is reasonable to assume that the interaction effects will also be larger and therefore easier to detect. In this study we analyzed blood gene expression and genotype data from 121 subjects in the ECLIPSE Study [7,8] to test for interaction effects between cis-eQTL SNPs (i.e. SNPs within 250kb of any autosomal gene) and the expression levels of transcription factors (TFs), since one of the known mechanisms for expression quantitative trait loci (eQTLs) is disruption of TF-binding motifs [9]. However, after careful evaluation of empirical performance of standard methods, we found that Type I error rates can be severely inflated. In particular, we show through simulations that genome-wide interaction screening in the setting of moderate to large main and interaction effects poses two major challenges. The first challenge relates to data pre-processing. Heavy pre-processing is commonly applied to gene expression data to account for variability across samples, libraries, or experimental conditions [10][11][12]. Choices made at this stage can impact the results of interaction screening, and while some approaches likely address specific technical artifacts more effectively, no pre-processing strategy is known to be universally best [13,14]. Pre-processing often also includes variable normalization to obtain approximately-Gaussian data, which can help the small-sample performance of testing approaches (see e.g. [3,4]). However, interaction effects are scale-dependent [15][16][17] and nonlinear transformation of the data can have a major impact on the interpretation of interaction tests. The second challenge relates to statistical issues that arise in the presence of moderate to strong interaction effects. We and others showed in previous work that interactions can influence the distribution of a quantitative trait conditional on the interacting predictors [18,19]. For small interaction effects, as expected for most human traits and diseases, the impact on the outcome distribution is expected to be minimal. However, moderate to strong interaction effects can induce substantial heterogeneity of variance by genotypic class, which can in turn lead to inconsistent covariance matrix estimation. Non-constant variance can induce uncontrolled Type I error rates and decreased power. This implies that the presence of a strong interaction between two predictors (e.g., a SNP and a TF) can potentially invalidate screening for interaction effect between the interacting SNP and other risk factors.
Using simulation we investigate these issues by quantifying the performance of five analytical strategies to detect interaction: standard linear regression, two heteroscedasticity-consistent covariance estimates, dichotomizing the predictors, and a saturated model. More specifically, we assessed the robustness of these approaches when heteroscedasticity has been induced through interaction effects while varying sample size, minor allele frequency and the magnitude of the main and interaction effects simulated. We identify minimal conditions necessary for valid tests of interaction in eQTL studies. Finally, for illustration purposes, we also present the results from the TF by SNP interaction screening in ECLIPSE. This real data analysis confirms the findings of our simulations, highlighting that standard approaches might have severely inflated type I error rate. Moreover, we observed that the list of significant associations changed dramatically across approaches, especially when comparing analyses of transformed versus untransformed gene expression data, stressing that the two sets of analyses capture different patterns in the data. ranks are unlikely to induce false signals for marginal effects, and they are a valid alternative to non-linear regression when the distribution of the error is not normal [22]. Indeed, previous studies used ranks instead of raw data to identify upregulated genes [23,24]. Another advantage of rkt is that it addresses outlying values while preserving sample size. Expression data often has noise-induced outliers (i.e., a few individuals with high expression values) because of experimental artifacts or stochastic properties of the biological system, which can lead to an increased rate of type I and type II errors [25]. The naive correction based on removing outliers can substantially reduce sample size. Rank-based normalization allows for those subjects to be retained.
However, in contrast to screening for marginal effects, rkt might have a stronger impact on Type I and Type II error rates when testing for interaction [21,26]. More generally, non-linear transformations such as rkt can dramatically impact the interpretation of interaction effects. In particular, while differences in significance before and after transformations are expected to be relatively minimal when interaction effects are small (e.g. r 2 <1%, as in genome-wide association studies of common traits and diseases), this is not necessarily the case when the variance explained by predictors is large. Consider for example the generative model defined in Eq (1) (see Material and methods) where the expression of gene Y depends on a SNP G, an exposure E interacting with G (mRNA levels of a transcription factor in our case), and the residual ε follows a skewed-normal distribution with mean 0 and variance 1 (in order to reflect deviation from normality as observed for some gene expression data). Note that for simplicity, we assume the covariate Z has no effect (γ z = 0) on Y. In the absence of interaction effect (γ GE = 0), if E and G explain a relatively large amount of the variance of Y (e.g., r 2 !30%, see Fig A in S1 File, doing a rank-based inverse normal transformation of Y (or other type of non-linear transformation) can induce a statistical interaction between G and E. For illustration, we plotted in Fig 2 two representative scenarios of transformation inducing or removing interaction. One of the examples is where the generative model does not include interaction, while the rank-normal transform data does; and the second example illustrates the opposite case where the generative model does include an interaction between G and E, but the rank-normal transformed data does not. Note that we considered a non-normal residual as it can be an argument for applying a rank-based inverse normal transformation, but similar issues might arise when the residuals are normally distributed. Figs B-C in S1 File further illustrate how rank-based transformation can induce interaction when none is present on the original scale. While the scaledependency of interaction effects has been demonstrated previously [15][16][17], this issue is infrequently addressed in gene expression interaction analyses. Commonly applied non-linear data transformations should be accounted for in the analysis and interpretation of interaction effects in gene expression data.

The paradox of interaction effects
The second major challenge we found is that while weak interaction effects between two risk factors are unlikely to have a noticeable impact on the conditional distribution of the outcome, this is not the case for moderate to strong interactions. For an interacting SNP, this will be expressed through heterogeneity of variance of the outcome residual by genotypic class, a property that has been proposed as a target for interaction screening (using heterogeneity of the outcome's variance as a proxy for the residual's variance) [6,18,19]. Heteroscedasticity can cause the usual standard error estimates of ordinary least square coefficients to be inconsistent, thereby invalidating tests for interaction between that SNP and other risk factors (excluding the true interacting factor which would be under the alternative) [27]. To illustrate this point we conducted a simulation study using the model previously described (Eq (1), see We defined an outcome Y as a function of a single nucleotide polymorphism G with a minor allele frequency of 0.1, an exposure E normally distributed with mean 5 and variance 1, and a rightskewed normal distributed residual term ε. In the framework of this analysis, TF mRNA level is considered as an exposure E. We generated two datasets of 10,000 individuals for the two scenarios. In a) G and E have only main effects and each explain 20% of the variance of Y. In b) G and E main effects each explain 10% of the outcome variance, but also have an interaction effect explaining 20% of the variance of Y. Upper panels show Y as a function of E by genotypic class and trend slope from a standard linear regression. Lower panels show the same data plotted after a rank-normal transformation (rkt) of Y. Interaction effect (observed as differences in slope by genotypic class) appears or disappears depending on the transformation applied to Y. P-values for interaction are indicated in red. https://doi.org/10.1371/journal.pone.0173847.g002

eQTL interaction
Material and methods), but assuming the true interacting factor, E, unknown, and drawing ε, the residual from a normal distribution with variance scaled so that the variance of Y is one for all scenarios simulated. For simplicity we also assumed that E and Z follow normal distributions with mean 0 and variance 1. We tested for interaction between G and the non-interacting factor Z in a series of replicates simulated in the presence or absence of main effects for G, E and Z, as defined in Eq (2) (see Material and methods).
Fig 3 illustrates how moderate to strong interaction effects can induce variance heterogeneity of the outcome residual (δ) by genotypic class, resulting in inflation of the type I error rate of the interaction tests between G and Z. This is in agreement with previous work, also showing type I error inflation when misspecifying the main genetic effect [28][29][30]. Variance heterogeneity mostly depends on the strength of the interaction effect and the main effect of the (unmeasured) exposure, although increasing main effects of the tested interacting factor (here G and Z) can also worsen the type I error inflation (Fig 3 and Text A in S1 File). Assuming the magnitude of the interaction effects are similar to the marginal genetic effects observed in cis-

Fig 3. When a true interaction can bias interaction screening.
A quantitative outcome Y is defined as a linear function of a SNP G, an unmeasured exposure E, a measured exposure Z, and an interaction between G by E, with effect γ G , γ E , γ Z , and γ GE , respectively (as defined in Eq 1). All predictors were standardized to have mean 0 and variance 1. In the framework of this analysis, TF mRNA level is considered as an exposure E. We vary γ GE so that the interaction term explains between 0 and 30% of the variance of Y. For simplicity we assume that, when relevant, the main effect of either G, E, or Z explains the same amount of variance as the interaction effect and set ε so that the variance of Y equals 1. Using this model we simulated series of 10,000 replicates, each including 400 individuals and tested for interaction between G and Z using a model not including the unmeasured exposure E (as defined in Eq 2), in the absence of main effect of the predictors (γ G = γ E = γ Z = 0), panel a) or when including a main effect of G (γ G Upper panels show the increase in the residual variance of the outcome δ minus ε (so that models are comparable) stratified by genotypic class while increasing the interaction effect γ GE . Lower panels show the type I error rate α at a p-value threshold of 0.05 for the interaction tests between G and Z derived for each series of 10,000 replicates. eQTL screening (e.g., r 2 >30%, see Fig A in S1 File), this analysis demonstrates that genetic variants found to have significant interactions with multiple factors should be interpreted with caution, as this may indicate that the tested SNP is involved in a strong interaction with some factors, but not necessarily with the factors tested. The key point here is that in a simple model, assuming covariates included in the model are not confounding factors of the predictors tested, heteroscedasticity caused by a strong interaction between the tested variant and an unmeasured factor (or at least a factor not included in the model) can induce spurious interaction effects with other non-interacting factors. Note that heteroscedasticity generated by interaction would only be a limited concern for the test of the marginal genetic effect. Indeed G being involved in an interaction means it is related to the outcome tested, and while power and effect estimation might be impacted, a marginal effect association signal with G would likely represent a true signal.

Correcting for inflation
We then extended our simulations to assess the validity of five approaches in the presence of a strong interaction between the SNP tested and an unmeasured factor. As before, we assumed that the gene expression data have been corrected for technical artifacts but not rank transformed (step 1 and 2 in Fig 1). We considered first standard linear regression without further correction (Eq 2). We then considered the heteroscedasticity consistent (HC) covariancebased approach. A number of alternatives have been proposed to deal with this issue [31], and while some might perform better than others, a complete comparison of these methods is beyond the scope of this study. Therefore we focused on the two most established approaches, the sandwich covariance matrix estimator (HC0) proposed by White [27], and the jackknife HC covariance (HC3), which has been suggested as the most efficient approach to deal with heteroscedasticity in small sample size [32]. Both methods are commonly used in genetic association studies [28,33,34]. As proposed for GWAS, we also considered two other methods that address model misspecification [30], namely dichotomizing the exposure Z, and using a saturated model that includes non-linear main effects of the predictors. For the latter approach we simply included in the model a main effect for Z 2 and for each genotypic class (G = 1 and G = 2).
We compared the robustness of the five approaches across a series of one million simulated replicates for the null model of no interaction between G and Z. We considered 96 different scenarios, increasing sample size from 100 to 5,000 (reflecting sample size in recent analyses [3,4,6]), increasing the coded allele frequency (CAF, analogous to minor allele frequency in this case) from 0.05 to 0.5, assuming normal or a skewed normal distribution for the three continuous variables of the experiment (E, Z and ε), and assuming alternatively presence or absence of a main effect for Z. The magnitude of the G × E interaction effects and the main effects of the predictors (E, G, and Z when relevant) were generated at random for each replicate. As shown in the QQplots from Fig D-K in S1 File, sample size and CAF had the strongest impact on the results, while we observed minor differences when varying the other parameters. Fig 4 presents the average performance of the five tests across the two former parameters (the sample size and CAF). This simulation shows that tests for interaction in small sample sizes (<100) are subject to strong type I error rate inflation for all studied methods. This inflation decreases with increasing sample size, however it can remain substantial for low frequency variants. Moreover, the inflation is non-linear, meaning that genomic control (GC) correction [35] cannot ensure the validity of the test. Overall, HC3 had the best performance, displaying an inflation factor λ close to 1 and only minor or no inflation of low p-values when using a sample size of 1000 or greater and common SNPs (CAF ! 0.3).

Interaction effect screening in ECLIPSE
To illustrate these effects in real data, we conducted a two-step screening approach to identify interaction effects between cis-eQTL SNPs and candidate TFs in a genome-wide expression dataset from 121 subjects in the ECLIPSE Study [36]. Interaction triplets (SNP, TF and target gene) were defined based on an a priori list of TFs and significant SNP-gene pairs from the QQplots over series of 8 million replicates where an outcome Y is simulated as a function of a genetic variant G, an unmeasured exposure E, an interaction between G and E, and in 50% of the replicates a measured exposure Z. In the framework of this analysis, Z and E are considered as measured and unmeasured TF mRNA level, respectively. The validity of five tests is evaluated by comparing the observed -log 10 (pvalue) against the expected -log 10 (p-value) when testing for the null interaction between a G and Z. The tests include a standard linear regression using main and interaction terms only (STD), heteroscedasticity consistent-based tests using effect estimates from STD (HC0 and HC3), linear regression using binary-transformed Z (BIN), and a saturated model including a main effect of Z 2 and each genotype coded as dummy variable (SAT). We considered coded allele frequency (CAF) of 0.05 (first row), 0.3 (middle row) and 0.5 (bottom row), and sample size N of 100, 500, 1,000 and 5,000. We randomly draw E, Z, and ε, the residual of Y from either a normal or a right-skewed normal distribution. For each scenario we derived the genomic inflation factor λ GC . ECLIPSE cis eQTL analysis. We followed the pre-processing procedure of Fig 1, except for part 3 (i.e., no rank-based inverse normal transformation).
Step one consisted of testing all SNPs in the vicinity of a gene for cis-effect on the expression of that gene assuming an additive genetic effect and using a score test as implemented in the GGTools software [37], and to select the most significant cis-eQTL as candidate SNPs for interaction effect testing. Among the 18,834,685 gene-SNP pairs tested, we identified 132,074 SNPs with a q-value for association less than or equal to 0.01. From these results we selected the most significant SNP for each Affymetrix probe set, resulting in a total of 2,982 cis-eQTL SNPs associated with 3,334 target probe sets. As observed in other studies, the variance of the residualized expression phenotypes explained by cis-eQTL SNPs can be much larger than those observed for quantitative phenotype in human GWAS [38]. Fig A in S1 File presents the distribution of r 2 obtained from marginal genetic model for the 3,334 pairs. The average equals 0.45 with a maximum r 2 of 0.87. In comparison, under a null model with the same total number of tests and using the same pvalue threshold, we would expect to observe a distribution of effects to have a mean r 2 of 0.14 and a maximum of approximately 0.25.
In step two, all cis-eQTL SNPs selected at step one were tested for interaction with the expression level of candidate TFs obtained from the publication by Vazquerizas et al [39]. To reduce the multiple testing burden, candidate TFs were tested for interaction with a given gene only if their marginal association with the target probe set was nominally significant (p<0.05). Also, to avoid confounding by a cis-effect of a SNP on two genes in close physical proximity, all TFs within 10Mb of a candidate cis-eQTL SNP were not tested for interaction with that SNP and its target gene. Among 1494 TFs, 1292 TFs represented by 2,896 probesets were available for analysis in the ECLIPSE study. As shown in Fig A in S1 File, the variance of the target gene explained by candidate TFs was high, with an average r 2 of 0.35. Overall, there were 745,943 trios (target probe-set, cis-eQTL, and candidate TF) to be screened for interaction effects. For each of these trios we performed the standard linear regression on non-rank transformed data (std) and inverse-normal rank-transformed data (rkt), and we considered for both approaches the HC3 correction of the effect estimate variance to account for heteroscedasticity (h3 and rkt.h3, respectively).
Fig L in S1 File presents the QQplots observed for each of the four strategies. As in the simulations, we observed non-linear and strong inflation of low p-values, as measured by the genomic control (λ GC equal 1.36, 1.13, 1.22 and 1.10 for std, h3, rkt and rkt.h3, respectively). There were 151, 244, 4, and 75 significant interactions after correction for multiple comparisons (P-value < 1.0x10 -8 ), for std, h3, rkt and rkt.h3, respectively. While a few interactions were significant or near significant across all tests, most showed strong heterogeneity. Table 1 presents the top five interactions from each approach as well as the corresponding p-value and rank. Unsurprisingly, all SNPs from Table 1 had a strong marginal effect on the target gene. For example, rs8109474 explained 63% of the variance of target probe set 218824_at (PNMAL1). The significance of the interaction was also strongly correlated with the strengths of the marginal association between the candidate TF and the target probe set. As shown in Fig 5, λ GC value for the std interaction test increases to almost 2 when focusing on the candidate TFs showing the strongest association with the target gene. While some of the observed λ GC inflation might due to a true enrichment for interaction effects, a substantial part of the association is likely due to statistical artifacts. Conversely, the rkt test, which corrects for the effect of outliers and potentially reduces inflation caused by interaction between SNPs and unmeasured factors ( Table A in S1 File), shows lower inflation (average λ GC = 1.17). In addition, rkt appears to be stable when focusing on TFs with increasing association with the target. Assuming strongly associated TFs are more likely to have biological interaction with cis-eQTL

Fig 5. Distribution of interaction test lambdaGC in ECLIPSE.
We derived the genomic inflation factor (λ GC ) of the standard interaction test using across sub-groups stratified based on P TF.marg , the p-value for association between the target gene and the candidate transcription factors (TFs). Grey bars present the total number of interaction tests falling in each strata. Four approaches were performed: i) no normal rank-transformation of the expression data (std), ii) HC3 correction of the effect estimate variance to account for heteroscedasticity (h3), iii) normal rank-transformation of expression data (rkt), and iv) HC3 correction and normal ranktransformation of expression data (rkt.h3). SNPs, this flat curve raises the potential concern that, in agreement with our simulations, interaction effects on the original scale might be removed by the rkt transformation.

Discussion
Genomic data, and gene-expression data in particular, offer new opportunities to identify gene-gene and gene-environment interactions. In this study we attempted to screen for SNP by TF interaction on gene-expression using data from the ECLIPSE study, and we describe two methodological issues related to the detection and interpretation of statistical genomic interactions. First, interaction effects are scale-dependent and commonly applied pre-processing steps that perform non-linear transformation of expression data can induce or remove statistical interaction, making interpretation of results more challenging. Second, the effect sizes observed in eQTL data are substantially larger than those observed for genetic association with complex human traits and diseases. Assuming interaction effect sizes are similarly large, our simulations show that their presence can induce substantial heteroscedasticity, which itself can impact the robustness of interaction test screening. While heteroscedasticity and scale-dependent effects have been discussed in a broader context, recent screening for interaction effects in gene expression data have not specifically addressed these issues [3,4,6]. We used simulation and real data analysis to explore these issues and evaluate the performance of analytical strategies that avoid non-linear transformation of gene expression (step 3 in Fig 1) while preserving robustness. Our analysis suggests that using the jackknife heteroscedasticity consistent covariance (HC3) correction without applying rank-based inverse normal transformation would be the best approach if sample size is large enough. Rank-based inverse normal transformation, as well as other non-linear transformations of expression data, can have both positive and negative impact on interaction screening. It removes outliers and assures that the marginal distribution of the phenotype is normal, thus enabling better properties of interaction screening under a complete null model (i.e., in the absence of both main and interaction effect). We also observed in simulated data that heteroscedasticity induced by true interaction effects is partly reduced after applying rkt (Table A in S1 File). This partially explains the apparent overall better behavior of the rkt test in the ECLIPSE data analysis. However, previous work showed that rkt is an imperfect solution, because it can impact both type I and type II error rate [21,26]. Moreover, because it alters the outcome scale, rkt can potentially remove the targeted interaction effect and/or induce statistical interaction effects across other SNPs. The question of determining the appropriate scale for interaction testing is a critical issue that remains to be addressed. Rank-based inverse normal transformation or other non-linear transformations that are performed as standard practice for most gene expression analyses fundamentally alter the meaning of statistical interaction in a way that should be explicitly recognized in an interaction analysis. The recent focus on RNA-Seq analysis methods provides an opportunity to revisit this issue. RNA expression is fundamentally heavy-tailed, and many RNA-Seq methods use the negative binomial distribution to model these data, as opposed to the standard quantile normalization and linear modeling approach for microarray data [40]. For the detection of statistical interaction, it is less important to identify a "correct" scale for the analysis, than it is to specify a specific hypothesis for biological interaction and then choose the appropriate scale of the data so that detected statistical interactions will represent biological interactions of interest. From this standpoint, we suggest limiting data transformations and analyzing data on their native scale, though this approach does introduce methodological challenges related to the statistics of non-normal distributions.
Our simulations also demonstrate that an interaction between two factors can induce spurious statistical interaction effects between those factors and other non-interacting factors, leading to the paradoxical situation where the interaction effect screening can be invalid under the scenario of strong interaction. Exploring alternative solutions, we found that interaction effect screening can likely be performed safely among common SNPs (minor allele frequency, MAF ! 5%) in large sample sizes (N ! 5,000) using the HC3 correction, while for smaller sample size (N ! 1000), there will remain uncertainty on the validity of association signal. This result is complementary to previous publications highlighting other challenges in genomic interaction screening. In particular, there has been controversy regarding the validity of previously reported [3] SNP-SNP interactions in eQTL analysis [41,42]. These studies highlighted that testing for cis-cis interaction effects should be interpreted with caution, as an observed statistical interaction may reflect a haplotype effect. While our analysis focused on SNP-by-TF interactions, SNP-SNP interactions would likely face the issues discussed in our study, as the increased type I error rate we observed was driven by the hidden interaction between the SNP tested and an unmeasured quantitative trait, independently of the other tested interacting factor (another SNP in a SNP-SNP interaction screening).
We acknowledge that the proposed strategy does not necessarily represent the optimal solution. To fully address scaling and robustness issues, future work might explore alternatives to the HC3 correction that better model the data (e.g., tests that specifically model residual and predictor distributions) and also assess the impact of common pre-processing practices such as log transformation of raw expression values, quantile normalization [10], adjustment for the principal components of expression [11], and other procedures meant to reduce technical variability [43]. Various combinations of these corrections have been proposed either separately [13] or in an integrated framework [44]. With the exponential increase of genomic data, the validity and performance of existing approaches has been widely discussed for marginal association screening [12,13,[45][46][47], and best practices evolve continuously with new methodological developments and the rise of new technologies. While the proposed approach is an important step toward more robust and interpretable interaction effects screening in genomic data, identifying the optimal analytical strategy will similarly follow an iterative process with additional theoretical work and validation from real data applications.

Non-linear transformation of expression data
We generated an outcome Y as a function of a SNP G, an exposure E and a covariate Z using the following generative model: where γ 0 , γ G , γ E , γ Z and γ GE are the intercept and the effects of G, E, Z and the interaction effect between G and E, respectively; and ε is the residual. We set the minor allele frequency of G at 0.1, and generated E using a normal distribution with mean 0 and variance 1. We considered two scenarios. In the first one, G and E have only main effects, each explaining 40% of the outcome variance, but they do not interact (γ GE = 0). In the second scenario, only E has a main effect (γ E 6 ¼ 0) and (γ G = 0), while G has no main effect but influences Y through its interaction with E. The residual ε is generated following a right-skewed normal distribution and is scaled so that the variance of Y equals 1. For each scenario we plotted Y and rkt(Y) as a function of E, where rkt() is a rank-based inverse normal transformation function. The rkt transformation was performed using the R function rntransform() from the GenABEL R package. We then estimated the effect of E on both Y and rkt(Y) by genotypic class using standard linear regression.

Effect of true interaction on interaction screening
To explore the impact of true interaction effects, we then generated a series of data using Eq (1). For simplicity we standardized all predictors to have mean 0 and variance 1. We vary γ GE , the interaction effect between G and E, so that the variance of the outcome explained by the interaction τ G×E varies from 0 to 30%, where τ G×E = var(γ GE GE) / var(Y). Note that τ G×E is similar to the standard definition of variance explained by interaction effects as all predictors are standardized, however this would not be the case for unstandardized predictors [2]. For simplicity we also assumed that, when relevant, the main effects of either G, E, or Z explain the same amount of variance as the interaction effect. We draw ε, the residual of Y, from a normal distribution with mean 0 and variance scaled so that the unconditional variance of Y equals 1. From this model we generated a series of 10,000 independent replicates of 400 individuals and we conducted a test ofb GZ , the estimated interaction effect between G and Z from the model: We generated multiple series of data from this model in the absence of main effects (γ G = γ E = γ Z = 0), including a main effect of G (γ G 6 ¼ 0), a main effect of E (γ E 6 ¼ 0), or a main effect of G (γ Z 6 ¼ 0). We first evaluated the relationship between τ G×E and δ the residual variance of the outcome from Eq (2), when stratifying by genotypic class. However, to allow for a clearer comparison we also subtracted ε from δ (see Supplementary Note). We then derive for each simulated scenario the type I error rate α of β GZ , the interaction tests between G and Z, at a p-value threshold of 0.05, defined as ðS N s p GZ aÞ=N s , where N s the number of simulations equals 10,000 and p GZ is the p-value of the interaction effect.

Correction for statistical tests
We considered standard linear regression and four alternative approaches that can potentially correct for the non-linear effect of the predictor in the interaction tests. First, we used two heteroscedasticity consistent covariance-based approaches. In brief, we used the interaction effect estimate ðb GZ Þ from Eq (2) and derivedŝ b GZ jHC , the standard deviation of the interaction term, using HC0 and HC3 formulation using the vcov() function from the Sandwich R package. We then derived the Wald test for each updated variance estimates, w HC0 ¼b 2 GZ =ŝ 2 b GZ jHC0 and w HC1 ¼b 2 GZ =ŝ 2 b GZ jHC3 , and their associated p-values. The third correction entails using a dummy variable for Z instead of the raw continuous coding. The dummy variable, Z bin , equals 0 for values of Z smaller than its median and 1 otherwise. The interaction test is then performed by evaluating the term b GZ bin from the following model: The last correction entails using a saturated model where the main effects of the interacting factors (here, G and Z) are modelled using additional terms. Various saturated models can be built. In these analyses we defined a model that includes a main effect of Z 2 and encodes the main effect of the genotype using two dummy variables corresponding to G = 1 and G = 2. We tested b 0 GZ , the interaction effect between Z and G, using its ordinal coding: We generated a series of 1 million replicates using the model from Eq (1) where γ = (γ G , γ E, γ GE, γ Z ) are randomly sampled from a uniform distribution (0,1). Unless otherwise specified, continuous variables (E, Z and ε) were drawn randomly from a normal distribution or rightskewed normal distribution, both with mean 0 and variance 1. The residual ε was further scaled so that the variance of Y explained by the simulated predictors in Eq (1) would vary from 0% to 80%. However, to explore the impact of deviation from normality for each of the three quantitative variables (E, Z and ε) we also performed simulations while using a normal distribution only for all replicates. We also compared scenarios where the either the true interacting exposure E or the tested interacting exposure Z have a main effect against scenarios where they have no main effect. Finally, we varied the sample size from 100 to 5,000 and the frequency of the coded allele for G from 0.05 to 0.5. Overall, our simulations covered 96 scenarios.

The ECLIPSE data
Study participants included 121 COPD cases genotyped as part of the ECLIPSE study [36] using the genome-wide Illumina HumanHap550 BeadChip. Each participant had data on~6.1 million SNPs, either directly genotyped or imputed using the 1000 Genomes EUR reference panel (March 2010) [48]. Details on quality control assessment, filtering of the SNPs and genotype imputation have been described elsewhere [49,50]. Gene expression was measured from whole blood samples on the Affymetrix HGU 133Plus 2.0 chip, and eQTL analysis was performed as previously described [8]. Gene expression data were log-transformed and quantile normalized using the RMA function in the "affy" R package [51] (step1 in Fig 1). We then regressed expression values on age, gender, the first principal components derived from the genotype data on all ECLIPSE participants [52], and the first 13 principal components from gene expression, retaining residuals from this regression for further eQTL analysis (step2 in Fig 1). Genotype data of all samples used in this study are available in dbGap (phs001252.v1. p1, in process). ECLIPSE expression data has been previously submitted to GEO as part of another project (229 samples) with GSE76705. Note that the ECLIPSE dbGaP submission (phs001252.v1.p1, in process) will also contain links to the GEO expression data.

S1 File. Text A, Figures A-L, and Table A.
(PDF)