Epistatic Module Detection for Case-Control Studies: A Bayesian Model with a Gibbs Sampling Strategy

The detection of epistatic interactive effects of multiple genetic variants on the susceptibility of human complex diseases is a great challenge in genome-wide association studies (GWAS). Although methods have been proposed to identify such interactions, the lack of an explicit definition of epistatic effects, together with computational difficulties, makes the development of new methods indispensable. In this paper, we introduce epistatic modules to describe epistatic interactive effects of multiple loci on diseases. On the basis of this notion, we put forward a Bayesian marker partition model to explain observed case-control data, and we develop a Gibbs sampling strategy to facilitate the detection of epistatic modules. Comparisons of the proposed approach with three existing methods on seven simulated disease models demonstrate the superior performance of our approach. When applied to a genome-wide case-control data set for Age-related Macular Degeneration (AMD), the proposed approach successfully identifies two known susceptible loci and suggests that a combination of two other loci—one in the gene SGCD and the other in SCAPER—is associated with the disease. Further functional analysis supports the speculation that the interaction of these two genetic variants may be responsible for the susceptibility of AMD. When applied to a genome-wide case-control data set for Parkinson's disease, the proposed method identifies seven suspicious loci that may contribute independently to the disease.

is a function of the genotype of the loci that belongs to 1 S but not 2 S . The penetrance of 1 S is the product of a function of the genotype of a sub set of 1 S ( 2 S ) and a function of the genotype of the complement of this sub set ( 12  S ). This is inconsistent with the assumption that 1 S is an epistatic module.
On the other hand, if 21   SS  -2 -Now we have 1 1 2 2 1 ( 2 1 2) 1 ( 2 1 2) is a function of the genotype of the loci belonging to 1 S but not 2 S , and 1 2 2 2 1 2 2 1 2 1 2 2 is a function of the genotype of the loci belonging to both 1 S and 2 S . The penetrance of 1 S is the product of a function of the genotype of a sub set of 1 S ( 12 S  ) and a function of the genotype of the complement of this sub set ( 1 1 2  S  ). This is inconsistent with the assumption that 1 S is an epistatic module. Because 12 0  S  is inconsistent with the truth that either 1 S is an epistatic module, we have that 12 0  S  .
, and consequently the epistatic modules are independent in the case population. In control population, For complex disease, the disease prevalence is usually very small, we have

Determination of the penetrance of the combinatory genotypes in simulation studies
For each model, there is a relative risk table in which only one parameter f is to be determined (see table 1 in the main text for details). Considering that real penetrance (the risk to be affected) is the product of the relative risk and another parameter that represents the value that "1" in the relative risk table stands for, there are only two parameters to be determined. In order to control the "marginal effect size"  of the first disease locus (locus A in table 1) and the population prevalence () pD, there will be two equations to solve. The definition of  is the same as the one used in (Zhang and Liu 2007). We use b as the parameter that is equal to the value that "1" in the relative risk

RR
Given the marginal effect size  and the population prevalence () pD, the solution of two model parameters (b and f from R) can easily be obtained.
In simulation studies, we did not include the genotypes of the causative loci in the genotype-phenotype data. Instead, the genotypes of markers that are in LD with the causative loci were recorded in the data. We code the minor allele as 0 and the common allele as 1 here. Given (1 ) Since there are two equations for two variables, 0,0 Q and 1,0 Q , we can easily solve the equations to obtain 0,0 Q and 1,0 Q .
Assuming Hardy-Weinberg equilibrium, the transition probability matrix for marker genotypes given causative locus genotype is as follows:

Number of irrelevant SNPs
If we test all possible combinations of n SNPs in the observed data and adjust the p-value with an exact correction method (e.g., Bonferroni correction), the probability of type I error could be controlled at the predefined level (0.05 in the paper), under the condition that the test statistic (e.g., the Chi-squared statistic) has good asymptotic characteristics. Consequently, the type I errors of all the four methods compared in main text could be expected to be controlled at the predefined level (0.05).
To validate this, we provide Figure S1 that gives the result of the average numbers of irrelevant SNPs identified by the four methods in our simulation studies. In the figure, the average number of irrelevant SNPs for a parameter setting (sub-model) is calculated by counting the number of unassociated SNPs being identified at the significance level 0.05 after Bonferroni correction and dividing this number by the total number of simulated datasets for the parameter setting (i.e., 100). The results for epiMODE are obtained using the selection-testing-correction method. The cutoff value of the posterior probability for selecting epistatic modules for further statistical test is 0.20. Note that, this cutoff value may influence the number of modules that goes to the statistical test procedure, but will not influence the inclusion of irrelevant SNPs in identified epistatic modules much because our Bonferroni correction is based on the total number of SNPs instead of the number of selected SNPs. We can see that in most cases, the average number of irrelevant SNPs for epiMODE, BEAM and single-locus Chi-squared test are all controlled at about 0.05, while stepwise logistic regression generally identified more irrelevant SNPs.

Impact of Dirichlet hyper-parameters
In order to investigate the influence of Dirichlet hyper-parameters in the proposed method, we have also tried 0.3, 0.7, and 1.0 besides 0.5 that is used in the main text. The results are shown in Figure S2. Basically, as we can see from the results, the values of Dirichlet hyper-parameters do