Case-control studies of gene-environment interactions. When a case might not be the case

Case-control Genome-Wide Association Studies (GWAS) provide a rich resource for studying the genetic architecture of complex diseases. A key is to elucidate how the genetic effects vary by the environment, what is traditionally defined by Gene-Environment interactions (GxE). The overlooked complication is that multiple, distinct pathophysiologic mechanisms may lead to the same clinical diagnosis and often these mechanisms have distinct genetic bases. In this paper, we first show that using the clinically diagnosed status can lead to severely biased estimates of GxE interactions in situations when the frequency of the pathologic diagnosis of interest, as compared to other diagnoses, depends on the environment. We then propose a pseudo-likelihood solution to correct the bias. Finally, we demonstrate our method in extensive simulations and in a GWAS of Alzheimer’s disease.


Introduction
We are interested in using data from a case-control Genome-Wide Association Studies (GWAS) to estimate how an "environmental variable" modifies the effect of a genetic variant on a specific, pathologically defined disease state. However, the complication is that in many GWAS, the cases are a heterogenous group, where multiple distinct pathologically defined disease states have led to a common set of symptoms and a shared clinical diagnosis. In these scenarios, a genetic variant will appear to interact with the environmental variable if the genetic variant affects the pathologically defined disease state of interest and the environmental variable is related to the proportion of cases with that disease state.
The issue of heterogeneity among cases is, perhaps, most pronounced in neurologic and psychiatric disorders, where the clinically defined status is based primarily on descriptive criteria and is typically made in absence of biomarker measurements, imaging data, and biopsies. Our specific motivating study is a GWAS of late-onset Alzheimer's disease (AD), a neurodegenerative disorder PLOS  that is clinically characterized by progressive mental decline. Here, we are interested in identifying genetic variants specifically associated with a high abundance of amyloid deposits and neurofibrially tangles in the brain, which we refer to as "histopathologically defined AD." [1] Specifically, we are interested in whether carrying the ApoE ε4 variant, which in the study is considered the "environmental variable", modifies the effect of SNPs residing in Toll-Like Receptors (TLR) and Receptor for advanced glycation end products (RAGE) on histopathologically defined AD. Importantly, ApoE ε4 status is likely to be associated with the proportion of the GWAS cases who have histopathologically defined AD. Recent biomarker studies of AD [2] reported that 36% of ApoE ε4 non-carriers and 6% in ApoE ε4 carriers clinically diagnosed with AD do not have evidence of amyloid deposition. We provide a more detailed description of ApoE ε4, other the risk factors for AD and the heterogeneity of the disease in the Discussion section.
We are interested to test an association between single nucleotide polymorphisms (SNPs) residing in Toll-Like Receptors (TLR) and the true AD diagnosis, i.e. our goal is to identify the genetic that might have lea to amyloid plaques with associated cognitive decline. TLRs play a key role in an innate immune response to invading pathogens and are also important for triggering the adaptive immune responses. Dysregulation of human toll-like receptor function has been shown in aging [3]. Specifically to the etiology of AD, TLRs act through modification of the inflammatory state of microglia/macrophages [1]. Receptor for advanced glycation end products (RAGE) has been identified as receptor for amyloid-beta peptide [4].
There is an extensive literature on how the estimates of the main genetic effect can be biased in situations when disease status is misclassified, i.e. the clinical and pathologic diagnoses do not correspond [5]. We extend the literature by investigating the impact of misdiagnosis on estimates of the Gene-Environment interaction (GxE). In case-control studies, the effects of covariates have been traditionally assessed using logistic regression analysis [6]. Recently, however, Chatterjee and Carroll [7] noticed and proved that the assumptions of Hardy-Weinberg Equilibrium and Gene x Environment independence can be leveraged in the appropriate retrospective analyses to gain statistical efficiency. We adopt the principals derived by Chatterjee and Carroll [7] and develop a pseudo-likelihood model in settings when a case defined based on the clinical diagnosis might not be the case in terms of the true diagnosis defined pathophysiologically.
Our paper is organized as follows. First, in the Material and Methods section we present the setting, notation, and proposed pseudo-likelihood approach. Next, the Simulation Experiments section describes the simulation experiments conducted to compare the resulting performance of the proposed method with the performance of standard logistic regression using clinically defined disease. In the same section, we apply our method to the motivating study of AD. The Discussion section concludes the paper.

Materials and methods
We define G be the genotype, e.g. SNPs measured at multiple locations. Let X be the environmental variables that interact with G and let Z be other environmental variables. We assume that the genotype is independent of all environmental variables and the genotypes follows Hardy-Weinberg Equilibrium: G~Q (g,θ). If θ is the frequency of minor allele a when the major allele is A, then the Hardy-Weinberg Equilibrium model [8] according to the number of minor alleles is Repository for Alzheimer's Disease (NCRAD), which receives government support under a cooperative agreement grant (U24 AG21886) awarded by the National Institute on Aging (NIA), were used in this study. We thank contributors who collected samples used in this study, as well as patients and their families, whose help and participation made this work possible; Data for this study were prepared, archived, and distributed by the National Institute on Aging Alzheimer's Disease We define D CL = {0, 1} be observed clinical disease status defined based on a set of symptoms. Suppose that the same set of symptoms can be caused by two distinct pathophysiologic mechanisms. Let D be the true disease status defined based on the underlying pathology, where D = 1 indicates the disease of interest, while D = 1 Ã is the nuisance disease. For ethical and/or budgetary reasons it might not be possible to measure the underlying pathology, hence D is latent. Instead, an evaluation is performed on a subset of patients or in an external reliability study. We define τ(X) = pr(D = 1|D CL = 1,X) to be the frequency of the true diagnosis of interest within the clinically diagnosed set that varies by the environment X. We let probabilities of the clinical and true diagnoses in the population to be p d cl ¼ prðD CL ¼ d cl Þ and π d = pr (D =d), respectively.
The clinical and true diagnoses are related prðD which indicates that the probabilities of the clinical diagnosis are weighted sums of frequencies of the true diagnoses. If pr( hence if there is no relationship between (X,G) and D, neither there is one between (X,G) and D CL .
We first consider a binary setting where the risk parameters are defined in terms of D = 1 vs. D = 1 Ã and D = 0 combined. Then the risk model is defined in terms of coefficients B = (β 0, β G ,β X ,β Z ,β G×X ) by In the second setting that we consider the risk model is defined separately for D = 1 vs. D = 0 in terms of B = (β 0, β G ,β X ,β Z ,β G×X ) and for D = 1 Ã vs.
In Eq (2) B and B Ã might share coefficients, e.g.
The observed data are collected using a case-control design where genetic and environmental variables are measured after the disease status is ascertained. However, the data will be analyzed as a random sample. To facilitate this analysis, we let δ = 1 be an indicator of selection into the study and consider the imaginary Bernoulli sampling with prðd ¼ 1jD and for model (2) In addition we let g d cl jd ðXÞ ¼ prðD CL ¼ d cl jD ¼ d; XÞ: Consider probability, Pr(D CL ,G|X,Z,δ = 1) and define a function L(d CL ,g,x,z;O) as follows.
The pseudo-likelihood can be used in place of the likelihood function based on arguments provided in the Appendix. Define C(d cl ,g,x,z;O) to be the derivative of log{L(d cl ,g,x,z;O)} with respect to O and where all expectations are taken with respect to the actual retrospective sampling scheme. Derivations shown in the Appendix demonstrate that under suitable regularity conditions there is a consistent sequence of solutions to L n ðOÞ ¼ 0 with the following property Remark 1: The intercept parameter k d cl is a function of the probability of disease in the population. Hence, if the probability of clinical diagnosis in the population is known or a good bound can be specified, this information can be used while estimating parameters. This cannot be done in the usual logistic regression setting.

Simulation experiments
The goal of the simulation study is to examine potential differences in the effect estimates of the genetic and environmental variables in their relationship to the 1) observed clinical diagnosis using the usual logistic regression model (uLR) and pseudo-likelihood model (pMLE) [7]; and 2) to the true disease status by using our pseudo-likelihood approach (pMLE-DX) that takes into account that only a proportion of the clinically diagnosed cases have the true disease. In pMLE-DX parameters are estimated based on Eq (4). Parameters are compared by their Bias and Root Mean Squared Error (RMSE). Simulations are performed using MatLab version R2017a.
In each setting we simulate 500 datasets with n 0 = n 1 2 {1000,3000,5000,10000,50000}. We let the genotype (G) be a Bernoulli random variable with frequency 0.10 to mimic a SNP and allow its effect to follow a recessive or dominant model. We set our other parameters to be similar to the values observed in our GWAS of AD. The binary variable X = {ε4+,ε4−}, which represents the ApoE ε4 status according to presence or absence of ε4 allele that occurs in approximately 14% of the population.
The proportion of the nuisance disease within the clinical diagnosis is defined as pr(D = 1 Ã | D CL = 1,ε4−) = 0.36 and pr(D = 1 Ã |D cl = 1,ε4+) = 0.06. The clinical diagnosis of late onset AD is defined for ages 65 and older. We simulated age (Z 1 ) to be Bernoulli with frequency 0.50 e.g. corresponding to a median split. Sex (Z 2 ) is Bernoulli with frequency 0.52 to reflect what we observed in the motivating data example of AD.
Setting A. We first examine a setting when the nuisance disease and controls are equivalent in that the risk parameters are defined for the disease of interest vs. the combination of controls and nuisance disease as in Eq (1). The risk coefficients In this setting, the frequency of the true disease status is pr(D = 1) = 46%, pr(D = 1|ε4−) = 40%, pr(D = 1|ε4+) = 82%. Table 1 presents properties of the risk parameter estimates in the datasets with n 0 = n 1 = 3,000. Additionally, shown in S1 Table are studies with n 0 = n 1 2 {1000,5000,10000,50000}. When the presence of the nuisance disease is ignored (uLR, pMLE),b ε4 andb GÂε4 are biased with elevated RMSE. For example, in a study with n 0 = n 1 = 3,000, the bias inb ε4 is -0.31 in uLR and pMLE, while the bias is reduced to 0.005 by pMLE-DX. RMSE is 0.33 in uLR and pMLE, while it is reduced to 0.12 by pMLE-DX. Similarly, bias inb GÂε4 is 0.56 in uLR and pMLE, while pMLE-DX reduces the bias by more than half. RMSE ofb GÂε4 is 2.5x larger when the presence of the nuisance disease is ignored. Notably, estimates of b Z 1 and b Z 2 are biased in uLR and pMLE. When sample size increased, the uLR bias inb GÂε4 decreased, e.g. the bias is 0.08 in a study with n 0 = n 1 = 10,000; while the bias inb ε4 persisted. Across all sample sizes,b G is biased by approximately -0.13, whereas considering the nuisance disease nearly eliminated the bias, e.g. to -0.01 in a study with n 0 = n 1 = 1000. We next examine if the presence of the nuisance disease could lead us to erroneously conclude that there was a significantb GÂε4 when β G×ε4 = 0. Here, we simulated datasets with β G×ε4 = 0. Table 2 presents estimates in a study with n 0 = n 1 = 3000 and S2 Table is based on studies with n 0 = n 1 2 {1000,5000,10000,50000}. Estimates of b 0 ; b G ; b Z 1 ; b ε4 ; and β G×ε4 are clearly biased when the presence of the nuisance disease is ignored. For example, in a study with n 0 = n 1 = 3,000, pMLE-DX decreased the bias inb GÂε4 from 0.12 in uLR and pMLE to 0.04, while RMSE remained approximately the same 0.41 vs. 0.43. Similarly, pMLE-DX reduced the bias inb ε4 from -0.26 in uLR to 0.007. At the same time, the RMSE ofb ε4 went from 0.28 (uLR, pMLE) to 0.12 (pMLE-DX). Increasing the sample size reduced the uLR bias forb GÂε4 ; e.g. the bias is 0.09 in a study with n 0 = n 1 = 10,000 but did not alleviate the substantial uLR bias in β ε4 . Across all sample sizes considered, the uLR estimates of β G are biased by approximately -0.12, while pMLE-DX reduced the bias to e.g. 0.01 in a study with 1,000 cases and 1,000 controls. We next consider the effect of underestimating pr(D = 1 Ã |D CL = 1,ε4+) and pr(D = 1 Ã |D CL = 1, ε4−) in the pseudo-likelihood. Here, we simulate data using the parameters specified above, but, when fitting the pseudo-likelihood (S3 Table), set pr(D = 1 Ã |D CL = 1,ε4−) = 0.3 and pr(D = 1 Ã | D CL = 1,ε4+) = 0, i.e. underestimated by 6%. Naturally, this misspecification introduced bias in some of the estimates and hence increased RMSE. Estimates of β ε4 were generally affected more than the estimates of the other parameters. For example, in a study with 3,000 cases and 3,000 controls, bias inb ε4 increased from 0.005 to -0.66 in pMLE-DX, while RMSE went from 0.12 to 0.67. In estimates of β G×ε4 , the bias increased from 0.22 to 0.32, while RMSE went up from 0.93 to 0.94. The bias inb G increased to -0.10 in a study with 3,000 cases and 3,000 controls, what has not The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 = 3000 controls and n 1 = 3000 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z 1 and Z 2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles. reached the level of uLR where the bias is -0.12. Estimates of b X 2 remained nearly unbiased with the same RMSE. We next consider the effect of overestimating pr(D = 1 Ã |D CL = 1,ε4+) and pr(D = 1 Ã |D CL = 1,ε4−) in the pseudo-likelihood (S4 Table). Here, we simulate data using the parameters specified above, but, when fitting the pseudo-likelihood, set pr(D = 1 Ã |D CL = 1,ε4−) = 0.42 and pr (D = 1 Ã |D CL = 1,ε4+) = 0.16, i.e. overestimated by 6%. As expected, this misspecification inflated the bias in the risk estimates. For example, in a study of 3,000 cases and 3,000 controls, bias inb ε4 increased from 0.005 to -0.43, while RMSE went from 0.12 to 0.44. Bias inb GÂε4 decreased from 0.22 to 0.17, while RMSE remained the same. Estimates of β G and b X 2 remained nearly unbiased.
Setting B. We next examine a setting when two sets of parameters define the risk of disease, i.e. for D = 1 vs. D = 0 and D = 1 Ã vs. D = 0 according to the risk model (2). Table 3 (n 0 = n 1 = 3,000) and S5 Table present parameter estimates in the setting With these parameters, the frequencies of the disease of interest and the nuisance disease are pr(D = 1) = 25.1%, pr(D = 1 Ã ) = 12.5%, pr(D = 1|ε4+) = 45.4%, pr(D = 1 Ã |ε4+) = 16.1%, pr(D = 1|ε4−) =20%, pr(D = 1 Ã |ε4−) = 16.1%. When presence of the nuisance disease is ignored (uLR, pMLE), estimates of β 0 ,β ε4 ,β G×ε4 ,β G are substantially biased.For example, in a study with 3,000 cases and 3,000 controls, in the bias of uLR forb ε4 is -0.22, while pMLE-DX reduced this bias to -0.006; the bias of uLR forb GÂε4 is -0.13, while pMLE-DX reduced this bias to 0.01; the bias of uLR bias forb G is 0.30, while pMLE-DX reduced it to 0.005. Biases in uLR persisted for larger sample sizes. If a priori evidence is sufficient to set parameters b Ã GÂε4 and b Ã G to 0, when in fact The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 = 3000 controls and n 1 = 3000 cases. Risk of the disease of interest is defined in a set of these coefficients are zero, then RMSE of pMLE-DX are further reduced by at least 2-fold (data not shown). Table 4 and S6 Table present the results in a setting similar to that of Table 3 but when there is no interaction between the genotype and ApoE4 status, i.e. β G×ε4 = 0. Ignoring the nuisance disease in the uLR resulted in bias in the estimate of β G×ε4 that is -0.23, which might mislead to a conclusion that there is an interactive effect between the genotype and ApoE ε4 status. The bias persisted for larger sample sizes.
Setting C. We next conducted a simulation study to better understand the underlying nature of the biases in the estimates noted when presence of the nuisance disease is ignored (uLR). For clarity, we simulated all variables to be binary. Variables G,Z 1 and Z 2 are Bernoulli with frequencies 0.10, 0.52 and 0.50, respectively. Risk coefficients , and β G×ε4 . The relationship between clinical and pathophysiological diagnosis is set to be pr(D = 1 Ã |D CL = 1,ε4−) = 0.36 and pr(D = 1 Ã |D CL = 1,ε4+) = 0.06. We simulated 500 datasets with 3,000 cases and 3,000 controls.  (1),log(1.5),log(2),log(2.5),. . .log (8) across the x-axis and b Z 2 is color-coded to be 0, 0.5, 1, 1.5. We show in panels A, B, C, D, and E, the biases GÂε4 , respectively. With increasing value of β ε4 , the biases in the main effect estimates of b Z 2 ; b Z 1 and β G increase. For example, the bias inb G reaches -0.10 when β ε4 is log (5). The bias inb ε4 andb GÂε4 is even more sensitive to value of β ε4 . For example, when The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 = 3000 controls and n 1 = 3000 cases. Risk of the disease of interest is defined in a set of  (8) across the x-axis and b Z 2 is color-coded to be 0, 0.5, 1, 1.5. We show in panels A, B, C, D, and E, the GÂε4 , respectively. In this setting, the biases in the main effectŝ b Z 2 ;b Z 1 andb G were approximately the same for all values of β G×ε4 , while the biases in the estimates ofb ε4 andb GÂε4 were more sensitive to the value of β G×ε4 . For example, when the interaction coefficient is 0, the bias ofb ε4 is nearly -2, while when β G×ε4 = log(8) = 2.08, the bias goes up to 3. When β G×ε4 = 0, the bias in the estimate is nearly zero, while the bias goes to almost 6 when the true value is log(8).

Analyses of genetic variants serving toll-like receptors and receptor for advanced glycation end products in Alzheimer's disease
We applied the proposed analyses to a dataset collected as part of the Alzheimer's Disease Genetics Consortium. The data has been anonymized prior to access by the authors. The data consists of 1,245 controls and 2,785 cases. The average age (SD) of Cases and controls are 72.1 (9.1) and 70.9 (8.8) years, respectively. Among cases, 1,458 (52.4%) are men; among controls, 678 (63.9%) are men. At least one ApoE ε4 allele is present in (64.5%) of cases and 365 (29.1%) of controls.
Illumina Human 660K markers have been mapped onto human chromosomes using NCBI dbSNP database (https://www.ncbi.nlm.nih.gov/projects/SNP/). Chromosome location, proximal gene or genes and gene structure location (e.g. intron, exon, intergenic, UTR) has been recorded for all SNPs. From these data, we inferred 111 SNPs to reside in genes serving Toll- Like Receptors (TLR). Similarly, we inferred 3 SNPs to reside in the Receptor for advanced glycation end products (AGER).
It is of interest to examine a relationship between the pathologic diagnosis and each of the 111 TLR SNPs (G), ApoE ε4 status (X), age (Z 1 ), sex (Z 2 ). The effect of SNPs might vary by ApoE ε4 hence we included interaction between the genotype and ApoE ε4 status. The genetic variables are modeled using a binary indicator of presence or absence of a minor allele.
We estimate parameters using the standard logistic model (uLR) that uses the clinical diagnosis as a surrogate of the pathophysiologic diagnosis and the pseudo-likelihood model (pMLE-DX) where we assume that the relationship between the clinical and pathophysiologic diagnosis is as estimated in the Salloway study [2], i.e. the proportion of the nuisance disease within the clinically diagnosed set is 36% in ApoE ε4 non-carriers and 6% in ApoE ε4 carriers. The pseudo-likelihood model pMLE-DX estimates the coefficients in a model that treats the nuisance disease and controls equivalently as in Eq (1). pMLE-DX Ã , however, estimates two sets of the risk coefficients as in Eq (2). Data analyses are performed using MatLab version R2017a. When optimizing the pseudolikelihood function we bounded the estimates to be on the interval [-5,5].
We first examine the results when statistical significance is assessed according to p-value<0.05. We next correct for false discovery rate using Benjanimi-Hochberg method [9].
TLR. Shown in Table 5 are estimates of the risk coefficients for 53 SNPs with permutation-based p-values forb G orb GÂε4 that are <0.05 in either of the analyses. Of these 53 SNPs, 28 SNPs are within 500k up-or downstream of the SNPs previously reported in GWAS on Alzheimer' disease, dementia, tauopathy, or/and vascular disease (S6 Table).
Estimates of β G or β G×ε4 differ numerically between the three approaches.  Gene-environment interactions when a case might not be the case      One SNP, rs830832, has significantb GÂε4 both in uLR (b GÂε4 ¼ 0:74; p ¼ 0:01) and pMLE À DX Ã ðb ðGÂε4Þ ¼ 2:6; p ¼ 0:03Þ. This SNP locates at the intergenic region between SORBS2 and TLR3 at Chromosome 4 and are 72k downstream of SNP rs75718659, which was reported associated with Alzheimer's disease in a family-based GWAS [10].
Among the seven SNPs appear to have significantb GÂε4 in pMLE-DX Ã but not uLR, two of the SNPs: rs4862611 (b GÂε4 = -2.8, p = 0.03) and rs1706143 (b GÂε4 = 2.9, p = 0.03), are also located at the intergenic region between SORBS2 and TLR3 at Chromosome 4 and are 80k and 20k downstream of SNP rs75718659.
Estimates of β G , however, are generally larger in magnitude when estimated in pMLE-DX and pMLE-DX Ã models.
Estimates of β ε4 in the absence of interaction are generally larger in magnitude for the diagnosis of interest in pMLE-DX. For example, in a model with SNP rs1816702 (uLRb ε4 ¼ 1:4; p ¼ 0:00 and pMLE-DX Ãb ε4 ¼ 2:6; p ¼ 0:03;b Ã ε4 ¼ 1:2; p ¼ 0:01). AGER. All of the three SNPs in the AGER gene measured in the data are associated with susceptibility to AD as inferred in uLR and also are associated with susceptibility to the nuisance disease when measured by pMLE-DX Ã . rs3134940 has been previously reported in association to breast cancer, type I diabetes and other phenotypes (https://www.gwascentral.org/ marker/HGVM1600838/results?t=ZERO); rs1035798 and rs2070600 have been previously reported in association to rheumatoid arthritis (https://www.gwascentral.org/marker/ HGVM275161/results?t=ZERO and https://www.gwascentral.org/marker/HGVM571318/ results?t=ZERO).

Discussion
We investigated if disease heterogeneity among clinically diagnosed cases could introduce bias into the estimates of GxE interactions. We showed that when there is a strong association between the environmental variable and the relative risk of the disease of interest, as compared to the nuisance disease, and then there could be bias in either direction. We base our developments on the method by Chatterjee and Carroll [7] that is fully efficient in situations when the genetic and environmental variables are distributed independently in the population, a population-based genetics model is assumed for the genetic factors and the environmental variables are treated non-parametrically.
Interestingly, in our analyses, the estimates of regression coefficients are qualitatively differed between the analyses that used the clinical diagnosis as a surrogate of the pathologic diagnosis and the analyses that used our newly proposed pseudo-likelihood approach that incorporates the uncertainty of the clinical diagnosis. Specifically, in TLR set for 13% of the SNPs examined, GxE was found to be significant in the relationship to the clinical diagnosis, while the pseudo-likelihood analyses inferred these GxE to be not significant. On the other hand, for 14% of the SNPs that we examined, GxE was found to be statistically significant only when we incorporated the uncertainty in the clinical-pathological diagnoses relationship. This finding is consistent with the conclusion reached by a study of phenotypic misclassification among cases [20] in situations when the misclassification is non-differential, i.e. is not a function of the environmental variables. The study concluded that presence of "non-cases" greatly decreased the estimates of risk attributed to the genetic variation.
One of the major concerns in the analyses of the genetic studies has been the missing heritability, when the genetic markers identified thus far explain only a small portion of inter-person variability in familiar clustering of complex diseases [21]. The downward biases in the estimates associating GxE to the clinically diagnosed disease status might in part explain the missing heritability. On the other hand, the upward biases in these estimates might in part address the conclusion reached by [22] that only 1% of the association found are likely to be true.
We examined estimates of the genetic effects, ApoE4 status, and age, sex consistent with the original publication on this dataset [23]. Epidemiologic evidence [24] suggests that the following factors play important role in AD risk: education/cognitive reserve, racial and ethnic difference, gender, smoking, drinking, head injury, diabetes, cardiovascular disease, obesity, social engagement, etc. However, not all of these factors have been consistently confirmed by subsequent studies, and considerable inconsistencies exist. For example, nicotine intake has been observed to decrease the risk of dementia due to the demonstrated ability of nicotine to stimulate neurotransmitter systems that are compromised in dementia [25]. More recent studies have suggested that nicotine intake may increase the risk of AD and also bring forward age of onset with APOE interactive effect [26].
The main conclusion reached in this paper is that using the clinically diagnosed status can lead to severely biased estimates of GxE interactions in situations when the frequency of the pathologic diagnosis of interest, as compared to other diagnoses, depends on the environment, and we aim to correct such biases by proposing pseudolikelihood method. AD dataset is mainly used for illustration, therefore, for clarity we restricted to variables to the minimum necessary instead of considering full risk prediction modes which might be able to better describe the inter-patient variability in susceptibility to AD. Although other factors are potentially important in predicting the risk of AD, this relatively simple model was able to achieve the main goals of the current manuscript. By recognizing and accounting for the potential of case heterogeneity, which biases the gene x environment interaction, our newly proposed method has the ability to remove this bias.
Define  [27]. If, however, O interacts with GxE, then addition of these variables would change the effect estimate of GxE in the direction that is consistent with the direction of the GxE effect. Further studies that incorporate environmental variables, such as medical history, tobacco use, and infections are needed for their potential to modify the risk and the estimates of GxE in particular.
Epigenetic mechanisms are well-recognized in the mediation of GxE and analysis of epigenetic changes at the genome scale can offer new insights into the relationship between brain epigenomes and AD. Further, candidate genes from epigenome-wide association studies interact with those from GWAS that can undergo epigenetic changes in their upstream gene regulatory elements [28]. However, an active conundrum is how the epigenetic mechanisms influence geneenvironment interactions. We note that Supporting information S1 Table. β G×ε4 6 ¼ 0. The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 controls and n 1 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z 1 and Z 2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles.  Table. β G×ε4 = 0. The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 controls and n 1 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z 1 and Z 2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles.  Table. Frequency of the nuisance disease is underestimated. The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 = 3000 controls and n 1 = 3000 cases. Frequency of ApoE ε4 allele in the population is 14%. Variables Z 1 and Z 2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles.  Table. Frequency of the nuisance disease is overestimated. Bias and Root Mean Squared Error (RMSE) for parameter estimates based on a study of 500 simulated datasets with n 0 controls and n 1 cases with clinical phenotype. Analyses are based on the usual logistic regression model that ignores nuisance disease and based on pseudolikelihood with (pMLE-DX) and without the consideration of clinical-pathological diagnoses relationship (pMLE). Frequency of ApoE ε4 alleles is 14% in the population. Variables Z 1 and Z 2 are Bernoulli with frequencies 0.50 and 0.52, respectively. Frequency of the true disease status is 46% in the population; and is 40% among the subpopulation with no ApoE ε4 alleles, and 82% in the subpopulation with at least one ApoE ε4 alleles. Frequency of nuisance disease within the clinical diagnosis varies by ApoE4 status pr(D = 1 0 |D CL = 1,ε4−) = 0.36 and pr(D = 1 0 |D CL = 1,ε4+) = 0.06. The clinicalpathological diagnoses relationship is misspecified to be pr(D = 1 0 |D CL = 1,ε4−) = 0.42 and pr (D = 1 0 |D CL = 1,ε4+) = 0.12. (DOCX) S5 Table. β Ã GÂε4 ¼ 0 and β Ã G ¼ 0. The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as t he outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudolikelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 controls and n 1 cases. Risk of the disease of interest is defined in a set of parameters b 0 ; b G ; b Z 1 ; b Z 2 ; b GÂε4 ; while the risk of the nuisance disease is parametrized by  Table. β G×ε4 = 0, β Ã GÂε4 ¼ 0; β Ã G ¼ 0. The Bias and Root Mean Squared Error (RMSE) in parameter estimates from simulations using the usual logistic regression with clinical diagnosis as the outcome (uLR), the pseudo-likelihood approach (pMLE), and our newly proposed pseudo-likelihood approach that accounts for misdiagnosis (pMLE-DX). For these simulations, the study included n 0 controls and n 1 cases. Risk of the disease of interest is defined in a set of parameters b 0 ; b G ; b Z 1 ; b Z 2 ; b GÂε4 ; while the risk of the nuisance disease is parametrized by  Table. Parameter estimates in Alzheimer's disease study. Analyses are performed using the usual logistic regression (uLR) that uses the clinical diagnosis as an outcome and using pseudo-likelihood method that assumes that the proportion of nuisance disease within the clinically diagnosed AD is 36% for ε4 carriers and is 6% for ε4 non-carriers. Pseudo-likelihood analyses pMLE-DX estimates parameters for D = 1 vs. D = 0 and D = 1 Ã combined. Pseudolikelihood analyses pMLE − DX Ã , however, estimate two sets of risk coefficients, i.e. βs for D = 0 vs. D = 1 and β Ã s D = 0 vs. D = 1 Ã . (DOCX) S8 Table. SNPs previously reported in GWAS that are within 500k up-or downstream of SNPs that we inferred in Alzheimer's disease study. (Table 5, SNPs whose effect estimates of β G and/or β G×ε4 are with permutation-based p-value <0.05). (DOCX)