^{1}

^{2}

^{1}

^{2}

^{3}

^{1}

^{1}

^{1}

^{1}

^{2}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: YY ID SM HY. Performed the experiments: YY QL ID SM. Analyzed the data: YY QL ID SM HY NSE EK XW SJ. Contributed reagents/materials/analysis tools: YY QL ID SM HY NSE EK XW. Wrote the paper: YY QL ID SM HY NSE EK XW KT.

In genome-wide association studies (GWAS), the association between each single nucleotide polymorphism (SNP) and a phenotype is assessed statistically. To further explore genetic associations in GWAS, we considered two specific forms of biologically plausible SNP-SNP interactions, ‘SNP intersection’ and ‘SNP union,’ and analyzed the Crohn's Disease (CD) GWAS data of the Wellcome Trust Case Control Consortium for these interactions using a limited form of logic regression. We found strong evidence of CD-association for 195 genes, identifying novel susceptibility genes (e.g.,

Analysis of genome-wide association studies (GWAS) often focuses on identifying individual single nucleotide polymorphisms (SNPs) that modify the risk of a phenotype, assuming the underlying association of an individual SNP without considering the involvement of any other SNPs. GWASs of Crohn's Disease (CD) have also focused largely on finding such ^{1}. The other SNPs that are statistically significantly associated with CD risk, however, show very weak associations with estimated odds ratios typically in the range of less than 1.5. In addition, the sum of such marginal associations is far from describing the estimated degree of genetic contributions to the risk of CD

To incorporate these specific forms of SNP-SNP interactions in GWAS data analysis, we propose using logic regression to search for sets of SNPs that are jointly associated with the phenotype of interest in the form of a single SNP intersection or union, or in combinations of thereof

Our logic-regression-based gene-level SNP-SNP-interaction analysis of GWAS data can be summarized as follows. Combinations of SNP intersections and unions can be expressed mathematically as Boolean combinations, such as (X1 ∧ X2) ∨ X3^{c}, where “∧”, “∨”, and “^{c}” represents intersection (AND), union (OR), and complement (NOT), and X's are indicators of SNP genotypes. The logic regression model takes the form:_{0}, β_{1},… β_{p} are the parameters, and L_{1}, L_{2,} …, L_{p} are Boolean combinations of genotype indicators of SNPs within a gene, also called logic trees. The logic trees are selected adaptively, using a Simulated Annealing algorithm, and based on deviance as the model fit measure ^{2}≥0.8). In each logic regression fit, we allowed a maximum of two Boolean combinations (Ls) of at most five indicators of SNP genotypes in total. Note that these constraints are necessary in GWAS because logic regression must search a large number of potential combinations, and therefore comes with a high computational cost. To correct for the inherent instability of the performance measure when searching a large space, we refit the logic regression 20 times, starting the algorithm with 20 different initial values: this process was applied to the original dataset as well as 20 datasets obtained by permutations of the case- control labels. Of the 20 results produced by the 20 starting values, we selected the best fit, measured by deviance.

Running logic regression for each gene in the original dataset, as well as their 20 case-control-label permuted datasets, yields an approximate Bayes Factor (BF) for each gene. The BF is approximated by the corresponding Likelihood Ratio in this case (which eliminates the need to specify priors, similar to the approximation used by Bayesian Information Criterion for BF), in the base-10 logarithm (equivalent to LOD Score), where the denominator is the median of 19 (log10) maximum likelihoods from the 19 permuted datasets (20 minus one because BF of a permuted dataset should not use its own BF in calculating the median of BF from the permuted datasets). An important feature of this approximate BF is that the denominator standardizes for the higher potential for genes with larger numbers of SNPs to overfit. We follow the Wellcome Trust Case Control Consortium (WTCCC) 's framework of using BF as the measure of evidence of the observed association between each gene and CD risk^{12}. Specifically, suppose we have N genes to be investigated, of which 10 genes are assumed to be truly associated with CD risk. The prior odds for CD-risk association for any gene is therefore 10/(N-10). To make the posterior odds of CD-risk association for a gene to 10 (i.e., probability that the gene is associated with CD risk is 10/11, or approximately 0.91), a likelihood ratio for the association over no association (i.e., the BF under the same-size logic-regression model) has to be (N-10). Based on the number of genes we examined in the WTCCC dataset (13,106 after mapping), the WTCCC framework above specifies a BF of 4.12 as the threshold, above which there is strong evidence of association between the gene and CD risk. The P-value for each gene is calculated as the proportion of all permuted BF values of

We checked if our BF-based hypothesis testing has a proper size (i.e. control over the false positive rate) by using a simulation study. We randomly chose 200 genes from Chromosome 1 (Chromosome 1 contains approximately 1,300 genes after mapping). We simulated a total of 50 null hypothesis datasets by shuffling case-control labels randomly and imposing an equal number of cases and controls in each dataset. We ran the logic-regression-based SNP-SNP interaction analysis and estimated p-value for each of the 200 genes in each of the 50 null datasets. The 10,000 p-values roughly followed a uniform distribution (data not shown), indicating that our testing procedure has a proper size and proper control over the false positive rate.

We applied the logic-regression-based SNP-SNP interaction analysis method to the WTCCC's GWAS data comparing 2,005 CD cases to the 1,502 members of the British 1958 birth control cohort (58C) plus the 1,500 controls of the UK Blood Service sample: these used Affymetrix GeneChip Human Mapping 500K Array Sets

The genotype calls of the WTCCC were generated by the Chiamo calling algorithm. Following the WTCCC's recommendations, we only considered genotype calls with confidence score >0.9, and treated the rest of the calls as missing genotypes. SNPs with SNP call rates less than 95% were removed. We also removed SNPs based on their minor allele frequencies: the default minor allele frequency cutoff in the GenABEL R package was used (2.5/N where N is the number of subjects), resulting in cutoffs of 0.05% for the WTCCC database and 0.3% for the Jewish and non-Jewish dbGaP databases. We used a cutoff of 0.2 for the Hardy-Weinberg Equilibrium (HWE) test's false discovery rates, based on controls. SNP-gene mapping files were retrieved from the OpenBioinformatics website (

We checked the homogeneity of the three populations, WTCCC, dbGaP non-Jewish and Jewish, by running Principal Component Analysis using the R package GenABEL

There were 195 genes with strong evidence of association between the gene and CD risk in the logic-regression gene-level SNP-SNP-interaction analysis of the WTCCC GWAS data, 40 of which are listed in ^{12}, as well as seven out of the eight regions showing moderate evidence of association, were represented among the 195 genes. Thirty-seven (63%) of the 59 chromosomal locations, that were previously identified by a meta-analysis of single-SNP studies that involved over 22,000 cases and 29,000 controls

Gene Name | Chromosome | #SNPs | p-value | C.BF |

ISX | 22q12 | 84 | <3.8×10^{−6} |
148.5 |

SEMA6A |
5q23 | 152 | <3.8×10^{−6} |
96.2 |

GTF3C4 | 9q34 | 4 | <3.8×10^{−6} |
91.8 |

PTGFRN | 1p13 | 15 | <3.8×10^{−6} |
85.5 |

ADRA1B |
5q33 | 45 | <3.8×10^{−6} |
82.3 |

MYLK3 | 16q11 | 2 | <3.8×10^{−6} |
77.0 |

HTR3B | 11q23 | 10 | <3.8×10^{−6} |
75.7 |

RRP15 | 1q41 | 29 | <3.8×10^{−6} |
75.4 |

RGL1 | 1q25 | 20 | <3.8×10^{−6} |
69.9 |

SORBS1 | 10q23 | 46 | <3.8×10^{−6} |
65.5 |

CALCOCO1 | 12q13 | 15 | <3.8×10^{−6} |
57.9 |

TMEM156 | 4p14 | 13 | <3.8×10^{−6} |
52.7 |

XRCC6BP1 | 12q14 | 38 | <3.8×10^{−6} |
45.9 |

FXR1 | 3q28 | 7 | <3.8×10^{−6} |
37.7 |

GARNL1 | 14q13 | 4 | <3.8×10^{−6} |
34.9 |

GPR161 |
1q24 | 7 | <3.8×10^{−6} |
30.9 |

SORCS1 |
10q23-q25 | 265 | <3.8×10^{−6} |
30.6 |

SAC |
1q24 | 13 | <3.8×10^{−6} |
28.4 |

LRP1B | 2q21 | 241 | <3.8×10^{−6} |
27.2 |

C18orf62 | 18q23 | 79 | <3.8×10^{−6} |
25.9 |

CSRP1 | 1q32 |
17 | <3.8×10^{−6} |
24.2 |

POU6F2 | 7p14 | 58 | <3.8×10^{−6} |
22.6 |

LEF1 | 4q23-q25 | 31 | <3.8×10^{−6} |
22.3 |

SEL1L | 14q31 | 170 | <3.8×10^{−6} |
21.9 |

SVIP | 11p14 |
88 | <3.8×10^{−6} |
21.7 |

VRK1 | 14q32 | 128 | <3.8×10^{−6} |
19.3 |

GLRX3 | 10q26 |
79 | <3.8×10^{−6} |
18.4 |

ID4 |
6p22 | 79 | <3.8×10^{−6} |
15.3 |

CDH10 | 5p14 | 107 | <3.8×10^{−6} |
14.9 |

NOD2 |
16q21 | 5 | <3.8×10^{−6} |
14.6 |

NHLRC1 |
6p22 | 7 | <3.8×10^{−6} |
14.0 |

FMN2 | 1q43 | 60 | <3.8×10^{−6} |
14.0 |

IL23R |
1p31 | 11 | <3.8×10^{−6} |
13.6 |

PTGER4 |
5p13 | 46 | <3.8×10^{−6} |
13.5 |

CTNNA3 | 10q22 |
257 | <3.8×10^{−6} |
13.3 |

PNPLA6 | 19p13 | 5 | <3.8×10^{−6} |
13.0 |

FBXO15 | 18q22 |
94 | <3.8×10^{−6} |
12.5 |

ATG16L1 |
2q37 | 7 | <3.8×10^{−6} |
12.4 |

RTP2 | 3q27 | 4 | <3.8×10^{−6} |
12.0 |

KCNIP4 | 4p15 | 154 | <3.8×10-6 | 11.8 |

indicates genes in the chromosomal locations where the WTCCC single-SNP analysis showed

indicates genes in the chromosomal locations where the WTCCC single-SNP analysis showed

indicates chromosomal locations are those with three or more genes in the 195 genes (see

Intestine Specific Homeobox (

rs11089728CC | rs9610191 TT | rs17778240 TT | rs17778240 TT | rs5999715AC | Logic-based Risk Groups | ||

Genotype Freq | Case N = 1748 | 797 (45.6%) | 10 (0.6%) | 466 (26.7%) | 466 (26.7%) | 214 (12.2%) | |

Cont N = 2936 | 1326 (45.2%) | 18 (0.6%) | 776 (26.4%) | 776 (26.4%) | 17 (0.6%) |

Logic 1 | AND (OR) | Frequency | Odds Ratio | ||

Logic 2 | AND | Case | Cont | ||

Logic-based Risk Groups | Logic 1 = No | Logic 2 = No | 1540 | 2562 | 1.0 |

Logic 1 = Yes | Logic 2 = No | 1 | 372 | .0045 | |

Logic 1 = No | Logic 2 = Yes | 2 | 0 | 172.2 | |

Logic 1 = Yes | Logic 2 = Yes | 205 | 2 |

We note that using the WTCCC dataset for discovery and the dbGaP non-Jewish and Jewish datasets for replication is untenable, because of the observed population differences (

GWAS Study Name | WTCCC | Non-Jewish dbGap | Jewish dbGap | |||||||

Sample Size (Cases/Controls) | (1748/2936) | (498/498) | (291/429) | |||||||

Gene Name | Chromosome | #SNPs | p-value | C.BF | #SNPs | p-value | C.BF | #SNPs | p-value | C.BF |

IL23R |
1p31 | 11 | <3.8×10^{−6} |
13.6 | 14 | <3.8×10^{−6} |
9.3 | 18 | 1.4×10^{−3} |
3.8 |

NOD2 |
16q21 | 5 | <3.8×10^{−6} |
14.6 | 4 | 2.9×10^{−5} |
5.8 | 4 | 8.7×10^{−3} |
2.8 |

TMEM183A | 1q32 |
10 | 7.6×10^{−5} |
5.3 | 5 | 6.0×10^{−4} |
4.3 | 5 | 1.9×10^{−3} |
3.7 |

SLCO6A1 | 5q21 | 15 | 5.7×10^{−5} |
5.4 | 11 | 4.7×10^{−4} |
4.4 | 11 | 1.6×10^{−2} |
2.4 |

PTGER4 |
5p13 | 46 | <3.8×10^{−6} |
13.5 | 50 | 1.6×10^{−4} |
4.9 | 53 | 1.6×10^{−1} |
0.9 |

CYLD |
16q12 | 30 | <3.8×10^{−6} |
11.2 | 22 | 2.2×10^{−3} |
3.6 | 21 | 1.1×10^{−1} |
1.2 |

SOCS6 | 18q22 |
111 | 3.4×10^{−5} |
5.6 | 147 | 4.7×10^{−2} |
1.8 | 145 | 1.5×10^{−2} |
2.5 |

ACAD11 | 3q22 | 4 | <3.8×10^{−6} |
6.2 | 5 | 3.7×10^{−1} |
0.3 | 5 | 1.2×10^{−3} |
3.9 |

CLSTN2 | 3q23 | 120 | <3.8×10^{−6} |
6.3 | 100 | 4.0×10^{−4} |
4.4 | 104 | 7.6×10^{−1} |
−0.5 |

SOX11 | 2p25 | 194 | <3.8×10^{−6} |
9.6 | 194 | 2.8×10^{−3} |
3.4 | 188 | 3.0×10^{−1} |
0.4 |

CEBPB | 20q13 | 15 | 5.9×10^{−4} |
4.2 | 27 | 2.2×10^{−1} |
0.7 | 28 | 5.5×10^{−3} |
3.1 |

C1orf141 |
1p31 | 10 | <3.8×10^{−6} |
10.3 | 6 | 4.1×10^{−3} |
3.2 | 7 | 3.5×10^{−1} |
0.3 |

NEK2 | 1q32-q41 |
11 | 8.8×10^{−5} |
5.1 | 13 | 4.0×10^{−1} |
0.2 | 12 | 4.9×10^{−3} |
3.1 |

NKX2-3 |
10q24 | 14 | 7.6×10^{−6} |
6.0 | 7 | 8.4×10^{−3} |
2.8 | 7 | 2.9×10^{−1} |
0.5 |

BSN |
3p21 | 4 | <3.8×10^{−6} |
7.0 | 3 | 1.2×10^{−2} |
2.6 | 3 | 7.0×10^{−1} |
−0.4 |

RBMS3 | 3p24-p23 | 157 | 2.1×10^{−4} |
4.7 | 177 | 8.8×10^{−1} |
−0.8 | 175 | 6.9×10^{−3} |
2.9 |

C10orf57 | 10q22 |
6 | 6.3×10^{−4} |
4.2 | 9 | 8.1×10^{−1} |
−0.6 | 9 | 1.4×10^{−2} |
2.5 |

indicates genes in the chromosomal locations where the WTCCC single-SNP analysis showed

indicates chromosomal locations are those with three or more genes in the 195 genes (see

Our results illustrate the power of the logic-regression-based GWAS analysis in identifying specific forms of SNP-SNP interactions associated with a phenotype and explaining a greater extent of CD genetics. We found strong evidence of CD-Association with 195 genes including both previously identified loci through the single-SNP analysis, in addition to newly identified susceptibility genes.

In this paper, we reduced the computational demand of logic regression by limiting the search to SNP combinations within the same gene, and also by fixing the size of SNP combinations in the search. These strategies have a definite disadvantage: the search will not be comprehensive and true underlying SNP-SNP interactions that are more complex than the limited size under consideration will not be discovered. In view of the current practice of assessing the marginal effects of individual SNPs one at a time, however, we submit that the limited form of logic regression proposed here provides a clear advance over, and an alternative to, the individual-SNP analysis. It can search for more biologically-plausible forms of SNP effects (combination of SNP intersections and/or unions) with greater degrees of association indicated by appreciably larger values of odds ratios, although the search remains approximate due to the limited size.

Despite the limitation of our approach by the small fixed size of logic regression models, the successful discovery of CD susceptibility genes demonstrates the potential utility of the logic-regression-based SNP-SNP interaction analysis of GWAS in providing additional insights to the

False positive discoveries by GWAS in which a large number of SNPs are examined for association with a disease are a major concern. Any discoveries including those reported here have to be validated rigorously in further investigations for exclusion of false positive from population stratification and genotyping errors. The candidate gene approach is also a valid alternative to the data-driven approach of GWAS, whether driven by a functional or biological hypothesis or possibly following the potential discoveries of GWAS. The application of logic regression is less computationally involved in candidate-gene studies, compared to GWAS.

Proper phenotyping is a key for increasing the chance to identify susceptibility genes specific for a clinical phenotype of interest. A recent paper on Crohn's disease

Increasing attention has been paid recently to pathway-based analysis of GWAS

(DOC)

(PDF)