A robust and adaptive framework for interaction testing in quantitative traits between multiple genetic loci and exposure variables

The identification and understanding of gene-environment interactions can provide insights into the pathways and mechanisms underlying complex diseases. However, testing for gene-environment interaction remains a challenge since a.) statistical power is often limited and b.) modeling of environmental effects is nontrivial and such model misspecifications can lead to false positive interaction findings. To address the lack of statistical power, recent methods aim to identify interactions on an aggregated level using, for example, polygenic risk scores. While this strategy can increase the power to detect interactions, identifying contributing genes and pathways is difficult based on these relatively global results. Here, we propose RITSS (Robust Interaction Testing using Sample Splitting), a gene-environment interaction testing framework for quantitative traits that is based on sample splitting and robust test statistics. RITSS can incorporate sets of genetic variants and/or multiple environmental factors. Based on the user’s choice of statistical/machine learning approaches, a screening step selects and combines potential interactions into scores with improved interpretability. In the testing step, the application of robust statistics minimizes the susceptibility to main effect misspecifications. Using extensive simulation studies, we demonstrate that RITSS controls the type 1 error rate in a wide range of scenarios, and we show how the screening strategy influences statistical power. In an application to lung function phenotypes and human height in the UK Biobank, RITSS identified highly significant interactions based on subcomponents of genetic risk scores. While the contributing single variant interaction signals are weak, our results indicate interaction patterns that result in strong aggregated effects, providing potential insights into underlying gene-environment interaction mechanisms.

For step 2.), we applied generalized additive models (GAMs) [4] using the bam function in the mgcv R package [5][6][7]. For these models, we used cubic regression splines with = 3 for each factor separately, and default options otherwise. For step 1.) and 3.), we used a LASSO-based linear model based on and , including interaction terms; the tuning parameter was chosen using cross validation. In practice, 2-3 iterations of steps 2.) + 3.) are often sufficient.

RITSS main effects implementation in the simulation studies and the UK Biobank application
The main effect models in RITSS were implemented using the bam function in the mgcv R package [5][6][7]. The genetic main effects were realized as non-smooth standard covariates (additive genetic model), assuming no interactions with . We implemented an option to include these interactions. The interaction scores were fitted as standard covariates. The environmental main effect ( , ) was realized using cubic regression splines with = 3 for each factor separately, otherwise default settings.

Screening strategies
Screening strategies 1: interaction between single environmental factor and subcomponent of genetic risk score Based on these objects, we split 1 1 randomly in two equally sized parts and perform approximate best subset selection [8] in each part with different subset sizes, where the corresponding other part is used as testing data. The best subset here corresponds to the best subset of variants, and we do not utilize the effect estimates per variant inferred in the best subset regression. Instead, we sum over the corresponding variants, describing the genetic risk score multiplied by the environmental factor. The size of the subsets is increased between 10 and , in steps of size 5. We select the best subsets in both parts of the data in terms of the association with in the test data and check the overlap between these subsets to create two different scores. Denote by 1 the set of overlapping variants (contained in both best subsets), and by 2 the set of nonoverlapping variants that were only in one best subset. The first score is then defined by 1 = ∑ 1 , and the second by 2 = ∑ 2 .

Screening strategies 2: aggregated single environmental factor interactions based on single variant testing
Based on the model = 0 + 1 + 2 + 0 + + for each genetic variant in step b.), we select all variants where the FDR q-value for is below 0.05 in the first score, and all variants where the FDR q-value for is between 0.05 and 0.10 in the second score. The corresponding estimated interaction effects ̂ are used to construct the corresponding interaction score. If there are no variants with FDR q-value below 0.1, we keep the variants with the smallest interaction p-value and construct only one score. If there is no variant with FDR q-value below 0.05, but below 0.1, we construct only one score with all variants with FDR q-value below 0.1.

Implementation of GAMsv
The alternative approach GAMsv was implemented using the bam function in the mgcv R package [5][6][7]. The genetic main effect of a single variant was modeled as a non-smooth standard covariate (additive genetic model), the same applies to the standard interaction term . The environmental main effect ( , ) was realized using cubic regression splines with = 3 for each factor separately; otherwise default settings were applied.

Simulation details
For all simulations, we set = 30,000, = 100, = 5, and = 2. The residual error is simulated by = * (1 + | | * 1 ), where ~(0,1) independent in the case of normal errors. In the scenario of non-normal errors, is sampled from the meancentered and standardized lung function ratio in the UK Biobank [9]. The parameter controls the presence of heteroscedastic errors. The phenotype is computed by: The effects 0 , , , , , and are generated as follows:  The SELECT:yes/no scenarios were implemented as follows: • SELECT:no: all variants are included.
• SELECT:yes, if the number of variants with association p-value < 5 * 10 −8 (based on all samples) is between 30% and 50% of , keep only these variants. If not: keep the 2 variants with the smallest association p-values.

Power simulations
For the power simulations, we used scenario 1 from the type 1 error simulations with SELECT:no and added a gene-environment interaction term to the model. The full model is given by    Table  B:  Type  1  error  simulation  results  based  on  1,000 replicates. / denotes the configuration of and the splitting fraction . RITSS1 KS test and RITSS2 KS test report the p-value of the Kolmogorov-Smirnov test, Bonferroni corrected for 40 tests (5 x 2 x 4). This test compared the p-values of RITSS1 and RITSS2 with a standard uniform distribution. The last columns report the number of significant (Bonferroni) interaction p-values, that is, the number of p-values with <

Study population
Our data analysis utilized participants from the UK Biobank [9]. We selected only participants of European ancestry, where ancestry was derived based on a combination of self-reported ethnicity and k-means clustering of principal components of genetic ancestry, as previously described [10]. Besides standard filters for the UK Biobank, quality control included the exclusion of related pairs, and excluding participants whose spirometry data did not meet quality control standards [10,11] or that had missing phenotype/covariate data (see below). Overall, we kept 254,033 participants for the analysis.

Phenotype and covariate data
We incorporated lung function data as measured by forced expiratory volume in 1 second (FEV1), forced vital capacity (FVC), and the ratio FEV1/FVC. We also extracted age, sex, standing height, smoking exposure variables, genotyping array, and the first ten principal components of genetic ancestry. Smoking exposure was based on self-reports and included the variables pack-years of smoking (P-Y-S) and ever-versus never-smoking status (E-S). 'Ever-smokers' included individuals reporting current smoking, smoking most days, smoking occasionally, or former smoking. 'Never smokers' included those who smoked less than 100 cigarettes in their lifetime. Standing height is referred to as height in the main text.

Genetic data
Genotyping and imputation for the UK Biobank were performed as described in the corresponding publications [9,10]; this genetic data was directly available for our analyses. For each of the four phenotypes FEV1, FVC, FEV1/FVC, and height, we downloaded all reported genetic associations from the GWAS catalog [12] (February-August, 2021) (EFO_0004314, EFO_0004312, EFO_0004713, and EFO_0004339, respectively). We extracted the corresponding genetic variants with a minor allele frequency above 1% (estimated in the analysis dataset) and performed LD pruning (indep-pairwise command with parameters 500, 50, and 0.2) using PLINK2 (version v2.00a2.3LM) [13] to exclude genetic variants that are in strong LD. We also excluded multi-allelic variants. Our analyses are based on expected minor allele count information, as computed by PLINK2. The final numbers of variants for the analysis of each respective phenotype are described in Table 1 of the main text.