Screening for interaction effects in gene expression data

doi:10.1371/journal.pone.0173847

Fig 1.

Gene expression data pre-processing pipeline.

Standard pre-processing methods applied to gene expression data prior to expression quantitative trait locus analysis. Note that alternative strategies are also used. For example step 2 is sometimes skipped and confounding factors (e.g. batch) are included in the model tested as covariates. Others have also applied step 3 before step 2.

More »

Expand

Fig 2.

Effect of non-linear transformation on interaction effects.

We defined an outcome Y as a function of a single nucleotide polymorphism G with a minor allele frequency of 0.1, an exposure E normally distributed with mean 5 and variance 1, and a right-skewed normal distributed residual term ε. In the framework of this analysis, TF mRNA level is considered as an exposure E. We generated two datasets of 10,000 individuals for the two scenarios. In a) G and E have only main effects and each explain 20% of the variance of Y. In b) G and E main effects each explain 10% of the outcome variance, but also have an interaction effect explaining 20% of the variance of Y. Upper panels show Y as a function of E by genotypic class and trend slope from a standard linear regression. Lower panels show the same data plotted after a rank-normal transformation (rkt) of Y. Interaction effect (observed as differences in slope by genotypic class) appears or disappears depending on the transformation applied to Y. P-values for interaction are indicated in red.

More »

Expand

Fig 3.

When a true interaction can bias interaction screening.

A quantitative outcome Y is defined as a linear function of a SNP G, an unmeasured exposure E, a measured exposure Z, and an interaction between G by E, with effect γ_G, γ_E, γ_Z, and γ_GE, respectively (as defined in Eq 1). All predictors were standardized to have mean 0 and variance 1. In the framework of this analysis, TF mRNA level is considered as an exposure E. We vary γ_GE so that the interaction term explains between 0 and 30% of the variance of Y. For simplicity we assume that, when relevant, the main effect of either G, E, or Z explains the same amount of variance as the interaction effect and set ε so that the variance of Y equals 1. Using this model we simulated series of 10,000 replicates, each including 400 individuals and tested for interaction between G and Z using a model not including the unmeasured exposure E (as defined in Eq 2), in the absence of main effect of the predictors (γ_G = γ_E = γ_Z = 0), panel a) or when including a main effect of G (γ_G ≠ 0, panel b), a main effect of E (γ_E ≠ 0), panel b), or a main effect of G (γ_Z ≠ 0, panel d). Upper panels show the increase in the residual variance of the outcome δ minus ε (so that models are comparable) stratified by genotypic class while increasing the interaction effect γ_GE. Lower panels show the type I error rate α at a p-value threshold of 0.05 for the interaction tests between G and Z derived for each series of 10,000 replicates.

More »

Expand

Fig 4.

Robustness comparison.

QQplots over series of 8 million replicates where an outcome Y is simulated as a function of a genetic variant G, an unmeasured exposure E, an interaction between G and E, and in 50% of the replicates a measured exposure Z. In the framework of this analysis, Z and E are considered as measured and unmeasured TF mRNA level, respectively. The validity of five tests is evaluated by comparing the observed -log₁₀ (p-value) against the expected -log₁₀ (p-value) when testing for the null interaction between a G and Z. The tests include a standard linear regression using main and interaction terms only (STD), heteroscedasticity consistent-based tests using effect estimates from STD (HC0 and HC3), linear regression using binary-transformed Z (BIN), and a saturated model including a main effect of Z² and each genotype coded as dummy variable (SAT). We considered coded allele frequency (CAF) of 0.05 (first row), 0.3 (middle row) and 0.5 (bottom row), and sample size N of 100, 500, 1,000 and 5,000. We randomly draw E, Z, and ε, the residual of Y from either a normal or a right-skewed normal distribution. For each scenario we derived the genomic inflation factor λ_GC.

More »

Expand

Fig 5.

Distribution of interaction test lambdaGC in ECLIPSE.

We derived the genomic inflation factor (λ_GC) of the standard interaction test using across sub-groups stratified based on P_TF.marg, the p-value for association between the target gene and the candidate transcription factors (TFs). Grey bars present the total number of interaction tests falling in each strata. Four approaches were performed: i) no normal rank-transformation of the expression data (std), ii) HC3 correction of the effect estimate variance to account for heteroscedasticity (h3), iii) normal rank-transformation of expression data (rkt), and iv) HC3 correction and normal rank-transformation of expression data (rkt.h3).

More »

Expand

Table 1.

Top 5 interaction signals for four different analytical strategies.

More »

Expand