Epistasis, i.e., the interaction of alleles at different loci, is thought to play a central role in the formation and progression of complex diseases. The complexity of disease expression should arise from a complex network of epistatic interactions involving multiple genes.
We develop a general model for testing high-order epistatic interactions for a complex disease in a case-control study. We incorporate the quantitative genetic theory of high-order epistasis into the setting of cases and controls sampled from a natural population. The new model allows the identification and testing of epistasis and its various genetic components.
Simulation studies were used to examine the power and false positive rates of the model under different sampling strategies. The model was used to detect epistasis in a case-control study of inflammatory bowel disease, in which five SNPs at a candidate gene were typed, leading to the identification of a significant three-locus epistasis.
Citation: Wang Z, Liu T, Lin Z, Hegarty J, Koltun WA, Wu R (2010) A General Model for Multilocus Epistatic Interactions in Case-Control Studies. PLoS ONE 5(8): e11384. https://doi.org/10.1371/journal.pone.0011384
Editor: Zvia Agur, Institute for Medical Biomathematics, Israel
Received: January 15, 2010; Accepted: May 31, 2010; Published: August 18, 2010
Copyright: © 2010 Wang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work is supported by National Science Foundation (NSF) grant DMS/NIGMS-0540745 and the Changjiang Scholars Award. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The complexity of biological systems arises from the highly interactive relationships of their components , . Thus, it is likely that the metabolic pathways for a phenotypic trait or disease involve multiple interacting gene products and regulatory loci that could generate a complex network of genetic actions and interactions , . Current genome-wide linkage or association studies have been able to detect genetic actions of individual genes involved in the phenotypic diversity of a complex trait –. Given its ubiquitousness in controlling complex traits and diseases, epistasis resulting from interactions between alleles at different genes has now received increasing attention in genetic studies , . However, many of these studies focus on the identification of low-order pairwise epistasis, leaving epistatic interactions of high orders, their frequency and impact on genetic variation, unexplored.
More recently, Stich et al.  developed a linkage mapping approach to uncover three-way interactions among different quantitative trait loci (QTLs) using a mating design. Beerenwinkel et al.  proposed a mathematical approach for describing multi-way genetic interactions and employing it to study the genetic structure of fitness landscapes for Escherichia coli. Based on the analysis of pathway fragments, Imielinski and Belta  used a genome-scale knockout design to detect high-order epistatic relationships between components of large metabolic networks. Hansen and Wagner  showed that higher-order genetic interactions are potentially important if the total genomic mutation rate is large and the interaction density among loci is not too low. With the widespread availability of high-throughpout genotyping technology, there is a pressing need to estimate higher-order epistasis involving any number of genes and assess the role of epistasis in the creation and maintenance of genetic variation for complex traits.
The motivation of this study is to develop a general model for estimating epistasis of any order with multilocus single nucleotide polymorphism (SNP) data in case-control studies. In particular, the model allows the estimation and testing of high-order epistasis. Because of its easy sample collection, a population-based case-control design has been widely used in candidate gene or genome-wide association studies –. By comparing genotype frequencies for a gene in unrelated individuals with the disease and healthy controls, this design has power to test the significance of the association between the gene and disease. However, only a few studies used a case-control design to characterize epistasis  and, also, the epistasis they defined on the basis of logistic regression models presents a computational complexity. The new model described in this article has, for the first time, embedded quantitative genetic principles into a chi-square test framework, allowing the dissection of overall multilocus genetic effects into various components including epistatic interactions of high orders. The model was validated through simulation studies and a real data analysis.
Large Quantitative Genetic Models for Epistasis
Epistasis was originally defined as the expression of an allele at one locus masked by an allele at another locus . This concept was then explained in a statistical manner by Fisher  as the deviation of genetic action from additivity in a linear model. Fisher's definition allows epistasis to be quantified in different forms based on its biological meaning determined by Bateson . For a two-locus epistasis, all possible forms of epistasis include the interactions between additive effects at the two loci, additive effect at the first locus and dominant effect at the second locus, dominant effect at the first locus and additive effect at the second locus, and dominant effects at the two loci. Each of these epistatic forms contributes differently to the overall genetic value of a two-locus genotype. We used Mather and Jinks' formulation  to partition a genotypic value into its different components including epistasis.
Suppose there are two loci, A with two alleles and and B with two alleles and , which form nine two-locus genotypes. Let denote the genetic value of an arbitrary genotype ( for genotypes , , and ; for genotypes , , and , respectively). We dissect into different components as table 1.
Where is the overall mean, and are the additive effect at genes A and B, and are the dominant effect at genes A and B, respectively, and , , , and are the additive additive, additive dominant, dominant additive, and dominant dominant epistatic interactions between the two genes, respectively.
Adding a locus, C with two alleles and , to the two-gene model generates 27 three-locus genotypes, expressed as ( for genotypes , , and , respectively). A three-genotypic value () is dissected into the following components:
- the overall mean ;
- the main genetic effects including the three additive effects (, , and ) at genes A, B, and C, and the three dominant effects (, , and ) at genes A, B, and C;
- the two-way interaction effects including the additive additive (), additive dominant (), dominant additive (), and dominant dominant () epistasis between genes A and B, the additive additive (), additive dominant (), dominant additive (), and dominant dominant () epistasis between genes A and C, and additive additive (), additive dominant (), dominant additive (), and dominant dominant () epistasis between genes B and C;
- the three-way interaction effects including the additive additive additive (), additive additive dominant (), additive dominant additive (), dominant additive additive (), additive dominant dominant (), dominant additive dominant (), dominant additive dominant (), and dominant dominant dominant () epistasis among genes A, B, and C.
We propose a general model for describing genetic components for a genotype composed of any number of loci. Consider loci which form genotypes. The value of a -locus genotype is composed of the overall mean, the additive and dominant effects for each locus, and epistasis of different kinds and orders among these loci. Let the space of the genetic effects at individual loci be defined as for gene 1, for gene 2, …, for gene . Thus, we can define all possible genetic effects () as
- If , then ;
- If , then ;
- If , then ;
- If , , …,, then .
- If , , …,, then .
Equation (8) gives a general form for main and interaction genetic effects among an arbitrary number of loci. Mathematical algorithms for solving epistatic equations are given in Text S1.
Based on the definitions, we now provide a procedure for testing epistasis of different kinds and orders with multilocus genetic data. Consider a case-control study in which cases (there is a disease) and controls (there is no disease) are selected randomly from a natural population. Case and control groups are matched for demographical factors such as age, race, gender, life style, and body mass. All subjects from the case and control groups are genotyped genome-wide or for particular chromosomal regions of interest, depending on the purpose of the study. Let and denote the observations of a general genotype () derived from three markers A, B, and C. Based on Mather and Jinks' partition of genotypic values , we calculate genetic effect parameters from genotypic values using equation (4). For both cases and controls, the genotypic values used to calculate each effect parameter are dissolved into two groups, plus and minus, which forms a 2 (cases and control)2 (plus and minus) contingency table. For example, the contingency table for testing the additiveadditiveadditive epistatic effect is expressed as table 2.
From the table, the test statistic is calculated and compared with the critical threshold with one degree of freedom. We proved that the test statistics under the null hypothesis calculated from the above contingency table follows a distribution with less than one degree of freedom .
The contingency tables for testing the other parameters can be made similarly. For a particular group ( = 1 for cases, 2 for controls), the genotypic values used to calculate the three-way epistatic effect parameters are tabulated as table 3.
The thresholds for testing each of these three-locus epistases are derived, which are = 3.84, 3.20, 3.20, 3.20, 2.60, 2.60, 2.60, and 2.14, respectively. The genotypic values used to calculate the two-way epistatic effect parameters are tabulated as table 4.
The thresholds for testing each of these two-locus epistases are derived, which are = 3.84, 3.84, 3.84, 2.50, 2.50, 2.50, 3.20, 3.20, 3.20, 3.20, 3.20 and 3.20, respectively. The genotypic values used to calculate the main genetic effect parameters are tabulated as table 5.
The thresholds for testing each of these two-locus epistases are derived, which are = 3.84, 3.84, 3.84, 2.60, 2.60 and 2.60, respectively. For an arbitrary number of markers, the genotypic values used to calculate the main and epistatic (of different orders) genetic effect parameters can be similarly divided into plus and minus groups, from which the test statistics are calculated.
The model was used to analyze a case-control study aimed to detect genetic variants for inflammatory bowel disease (IBD) with candidate gene approaches . As a member of the membrane associated guanylate kinase family, TDiscs large homolog (DLG5) plays a central role in maintaining cell junctions and cell shape and in clustering channel proteins at the cell surface . Five single nucleotide polymorphisms (SNPs), Arg30Gln, Glu514Gln, Pro979Leu, Gly1066Gly, and Pro1371Gln, genotyped at DLG5 for both cases and controls are hoped to be associated with IBD. The cases include 115 sporadic IBD patients, aged from 22 to 66 years old, from the Milton S Hershey Medical Center, whereas the controls are 172 unrelated healthy individuals, aged from 15 to 81 years, from the Milton S Hershey Medical Center and Philadelphia gift of Life Donor Program. All the human tissues used for pathological studies and genetic analysis were approved by the Human Subjects Protection Offices of The Pennsylvania State University College of Medicine, and were undertaken with the understanding and written consent of each subject.
Because of a modest sample size used, our analysis will focus on a three-SNP analysis, although the model can deal with any number of SNPs. None of the five SNPs displays an additive genetic effect, but Arg30Gln, Pro979Leu, and Gly1066Gly were each found to trigger a significant dominant effect on the disease () (table 6). There are 10 possible pairs for the five SNPs, with each pair subject to a two-locus epistatic analysis. The number and distribution of two-locus epistasis are given in table 7. It is interesting to see that significant two-locus epistasis was observed only between Arg30Gln and other SNPs including Pro979Leu with a significant main dominant effect and two non-significant SNPs (Glu514Gln and Pro1371Gln). The form of significant epistasis is limited to the interactions between the dominant effect at Arg30Gln and the additive/dominant effects at the other SNPs.
The five SNPs produce 10 three-locus combinations which were analyzed by a three-locus epistasis model. Each combination has eight forms of three-SNP epistasis. Table 8 lists the test statistics for all possible combinations and forms of epistasis, with significant epistasis highlighted in boldface. The interactions among the additive effects at any three of the five SNPs were not significant; the same was also observed for the three-way dominant interactions. The significant three-locus epistasis must include both the additive and dominant effect at three SNPs. In general, Arg30Gln have more significant three-locus interactions and display higher three-locus significance level than the other SNPs. Arg30Ln, Glu514Gln, and Pro979Leu produce the most numerous forms of epistasis (3), followed by the combinations of Arg30Ln, Gly1066Gly, and Pro1371Gln (2), Glu514Gln, Gly1066Gly, and Pro1371Gln (2), Arg30Ln, Glu514, and Pro1371Gln (1), Arg30Ln, Pro979Leu, and Pro1371Gln (1), Pro979Leu, Gly1066Gly, and Pro1371Gln (1). The three SNPs with significant main effects (Arg30Ln, Pro979Leu, and Pro1371Gln) do not produce a significant three-locus epistatic interaction. The two SNPs displaying non-significant main effects (Glu514Gln and Pro1371Gln) could generate significant three-locus interactions with SNPs Arg30Ln and Gly1066Gly but not with Pro979Leu (table 8).
After significant high-order epistasis is detected, the next step is to make a biological interpretation of such epistasis. To interpret it, we will use the dominant ()additive ()additive () epistasis among Arg30Ln, Glu514Gln, and Pro979Leu as an example. Table 9 gives the structure of genetic effects for each three-locus genotypic value in terms of the additive, dominant, and epistatic effects of different orders. The epistasis only contributes to the genotypic value of , , , and (table 9). For each of these four genotypes, their values are partitioned into different effect components for both cases and controls (table 10). As can be seen, the epistasis increases, by 9 cases, the incidence of IBD for those with genotype or , but decreases the IBD incidence of those carrying genotype or with the same extent.
Simulation studies were undertaken to examine the statistical behavior of the new model. We will focus on the investigation of the power and false positive rates (FDR) for the detection of three-locus epistasis. Three different simulation schemes will be used with varying numbers of cases vs. controls, 200 vs. 200, 400 vs. 400, and 1000 vs. 1000. The eight possible forms of three-locus epistasis can be sorted into four presentative ones, (1) additiveadditiveadditive (no dominant effect), (2) additiveadditivedominant, additivedominantadditive, and dominantadditiveadditive (no one dominant effect), (3) additivedominantdominant, dominantadditivedominant, and dominantdominantadditive (two dominant effects), and (4) dominantdominantdominant (three dominant effects).
For a real data set, different SNPs may be associated or independent of each other. We will investigate how SNP-SNP associations affect the behavior of the new model. In one data set, three SNPs with the same allele frequency were simulated with pair-wise and three-locus linkage disequilibria. Among the three SNPs, only additiveadditiveadditive, additiveadditivedominant, additivedominantdominant, and dominantdominantdominant were assumed to exist. This can be done by simulating a contingency table with constraints , , , and the test statistics for the other effects the corresponding thresholds. The same parameters, except that there is no linkage disequilibrium, were used to simulate the second data set containing three SNPs.
Table 11, table 12, and table 13 give the power and false positive error rates (FPR) of the three-locus interaction detection by the new epistatic models. The power to detect the three-locus epistasis increase remarkably with sample size in a case-control study. With sample sizes of 200 vs. 200, there is power of about 0.51–0.61, with the additiveadditiveadditive epistasis detected most easily, followed by the additiveadditivedominant epistasis, the additivedominantdominant epistasis, and the dominantdominantdominant epistasis. When sample sizes increase to 400 vs. 400, the power for the three-locus epistasis detection will surpass three quarters. If sample sizes 1000 vs. 1000 are used, the power reaches 0.99 or more. In general, whether the SNPs are associated or independent does not affect the power substantially, although in some cases the power is higher for associated SNPs than independent SNPs.
The power displays a small FPR (tables 11, 12, and 13). Even if small sample sizes 200 vs. 200 are used, there is still a small chance that the model provides a false positive result for the three-locus epistasis detection. The FPR was found to be consistent, regardless of sample sizes and the degree of SNP-SNP associations.
The phenotypic variation of a trait or disease is highly complex given its polygenic inheritance and environmental influence. Most original quantitative genetic models generally assume that allelic effects are additive, with the size linearly proportional to the number of alleles. These models are modified by considering that there are genetic interactions between different alleles at the same locus (dominance). It is now recognized that the interactions between different loci (epistasis) within gene networks may play an important role , . More recent evidence shows that high-order epistasis among more than two genes may form a crucial component in genetic interaction networks , , , . In fact, quantitative genetic analyses have detected high-order epistatic effects in plants. For example, high-order epistasis could be correlated with the aggressiveness of the isolate of Phytophthora capsici through influencing double crosses among different loci at meiosis . Wu  used a mating design with clonal replicates to identify the significant contribution of high-order epistasis to genetic variation in stem wood growth traits in poplars.
An increasing availability of high-throughput SNP data has led to the development of various statistical approaches for effectively analyzing epistasis among multiple polymorphisms, including logistic regression, multifactor dimensionality reduction (MDR), Bayesian analysis, and machine learning , , –. In this article, we developed a general model for detecting the episatsis of any order in case-control genetic association studies by integrating traditional quantitative genetic principles. Despite the existence, high-order epistasis may be obscured by metabolic network redundancy . The integration of quantitative genetic principles makes our approach capable to identify high-order epistatic interactions with genetic relevance. The model was tested by simulation studies. It displays adequate power for the detection of high-order epistasis with a modest sample size; for example, 400 cases vs. 400 controls. When sample sizes of cases and controls increase to 1000 vs. 1000, which is currently not a problem for most genetic association studies, the model has almost full power to detect three-locus epistasis of different forms. Even if a small size of samples (say 200 vs. 200), the new model has a low false positive rate for epistatic detection. The practical application of the model is validated by analyzing a real data set for the genetic study of inflammatory bowel disease. The model detected significant three-locus epistatic interactions among different SNPs genotyped from a candidate gene DLG5 .
Our model allows the characterization of epistasis of any order. Its implementation into a practical setting of genome-wide association studies is challenged by an exponentially increasing number of SNP-SNP combinations. To make this tractable, one may incorporate optimization techniques into our model, allowing the selection of the most important combinations. An additional issue is to determine the critical threshold with multiple correlated SNPs in genome-wide association studies. An empirical approach for determining a genome-wide threshold is to employ non-parametric permutation testing (see ref. , , –). Lastly, the model is developed to detect multilocus epistasis at the SNP level, but given recent discoveries for the importance of haplotypes in trait control –, the model should be extended to consider high-order interactions expressed by different haplotypes. In the current model specification, we choose controls that are matched for cases in terms of biological, environmental, or demographical factors. When such matches are not possible, we need to embed these factors as covariates into the model, in which the interactions between genes and these factors can be tested. Third, the model can be extended with multiple diseases to consider the pleiotropic effect of a gene. The results about high-order epistasis detection using the this and extended models could be used for iterative model building and functional annotation of genes. Future applications of these results includes analysis of the metabolic networks of pathogenic organisms and generation of epistatic candidate models for genome-wide association studies.
Conceived and designed the experiments: RW. Performed the experiments: ZL JPH. Analyzed the data: ZW TL. Contributed reagents/materials/analysis tools: ZW TL ZL JPH. Wrote the paper: RW.
- 1. Weng G, Bhalla U, Iyengar R (1999) Complexity in biological signaling systems. Science 284: 92–96.
- 2. Hlavacek W, Faeder J (2009) The Complexity of Cell Signaling and the Need for a New Mechanics. Science's STKE 2:
- 3. Huang L, Sternberg P (1995) Genetic dissection of developmental pathways. Methods in cell biology 48: 97–122.
- 4. McMullen M, Byrne P, Snook M, Wiseman B, Lee E, et al. (1998) Quantitative trait loci and metabolic pathways. Proceedings of the National Academy of Sciences of the United States of America 95: 1996.
- 5. Ritchie M, Hahn L, Roodi N, Bailey L, Dupont W, et al. (2001) Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics 69: 138–147.
- 6. Martin M, Gao X, Lee J, Nelson G, Detels R, et al. (2002) Epistatic interaction between KIR3DS1 and HLA-B delays the progression to AIDS. Nature genetics 31: 429–434.
- 7. Gabutero E, Moore C, Mallal S, Stewart G, Williamson P (2007) Interaction between allelic variation in IL12B and CCR5 affects the development of AIDS: IL12B/CCR5 interaction and HIV/AIDS. AIDS 21: 65.
- 8. Hirschhorn J, Daly M (2005) Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6: 95–108.
- 9. Marchini J, Donnelly P, Cardon L (2005) Genome-wide strategies for detecting multiple loci that influence complex diseases. Nature genetics 37: 413–417.
- 10. Wang W, Barratt B, Clayton D, Todd J (2005) Genome-wide association studies: theoretical and practical concerns. Nature Reviews Genetics 6: 109–118.
- 11. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira M, et al. (2007) PLINK: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics 81: 559–575.
- 12. Wan X, Yang C, Yang Q, Xue H, Tang N, et al. (2010) Predictive rule inference for epistatic interaction detection in genome-wide association studies. Bioinformatics 26: 30.
- 13. Phillips P (2008) Epistasis-the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 9: 855–867.
- 14. Moore J, Williams S (2009) Epistasis and its implications for personal genetics. The American Journal of Human Genetics 85: 309–320.
- 15. Stich B, Yu J, Melchinger A, Piepho H, Utz H, et al. (2007) Power to detect higher-order epistatic interactions in a metabolic pathway using a new mapping strategy. Genetics 176: 563.
- 16. Beerenwinkel N, Pachter L, Sturmfels B, Elena S, Lenski R (2007) Analysis of epistatic interactions and fitness landscapes using a new geometric approach. BMC Evolutionary Biology 7: 60.
- 17. Imielinski M, Belta C (2008) Exploiting the pathway structure of metabolism to reveal high-order epistasis. BMC Systems Biology 2: 40.
- 18. Hansen T, Wagner G (2001) Epistasis and the mutation load: a measurement-theoretical approach. Genetics 158: 477.
- 19. Zhang Y, Liu J (2007) Bayesian inference of epistatic interactions in case-control studies. Nature genetics 39: 1167–1173.
- 20. Nunkesser R, Bernholt T, Schwender H, Ickstadt K, Wegener I (2007) Detecting high-order interactions of single nucleotide polymorphisms using genetic programming. Bioinformatics 23: 3280.
- 21. Gayán J, González-Pérez A, Bermudo F, Sáez M, Royo J, et al. (2008) A method for detecting epistasis in genome-wide studies using case-control multi-locus association analysis. BMC genomics 9: 360.
- 22. Bateson W (1910) Mendels principles of heredity. Molecular and General Genetics MGG 3: 108–109.
- 23. Kempthorne O (1968) The correlation between relatives on the supposition of mendelian inheritance. American Journal of Human Genetics 20: 402.
- 24. Workman P (1973) Biometrical genetics. The study of continuous variation. American Journal of Human Genetics 25: 461.
- 25. Liu T, Thalamuthu A, Liu J, Chen C, Wu R (2010) A model for testing epistatic interactions of complex diseases in case-control studies. Biostatistics. (in press).
- 26. Lin Z, Poritz L, Franke A, Li T, Ruether A, et al. (2009) Genetic association of DLG5 R30Q with familial and sporadic inflammatory bowel disease in men. Disease markers 27: 193–201.
- 27. Stoll M, Corneliussen B, Costello C, Waetzig G, Mellgard B, et al. (2004) Genetic variation in DLG5 is associated with inflammatory bowel disease. Nature genetics 36: 476–480.
- 28. Bartual R, Lacasa A, Marsal J, Tello J (1993) Epistasis in the resistance of pepper to phytophthora stem blight (Phytophthora capsici L.) and its significance in the prediction of double cross performances. Euphytica 72: 149–152.
- 29. Wu R (1996) Detecting epistatic genetic variance with a clonally replicated design: models for lowvs high-order nonallelic interaction. TAG Theoretical and Applied Genetics 93: 102–109.
- 30. Liang Y, Kelemen A (2008) Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases. Statistics Surveys 2: 43–60.
- 31. Kayano M, Takigawa I, Shiga M, Tsuda K, Mamitsuka H (2009) Efficiently finding genome-wide three-way gene interactions from transcript- and genotype-data. Bioinformatics 25: 2735.
- 32. Jiang R, Tang W, Wu X, Fu W (2009) A random forest approach to the detection of epistatic interactions in case-control studies. BMC bioinformatics 10: S65.
- 33. Carlborg O, Andersson L (2002) Use of randomization testing to detect multiple epistatic QTLs. Genetics Research 79: 175–184.
- 34. Alison M (2009) The effect of alternative permutation testing strategies on the performance of multifactor dimensionality reduction. BMC Research Notes 1:
- 35. Edwards T, Turner S, Torstenson E, Dudek S, Martin E, et al. (2010) A General Framework for Formal Tests of Interaction after Exhaustive Search Methods with Applications to MDR and MDR-PDT.
- 36. Judson R, Stephens J, Windemuth A (2000) The predictive power of haplotypes in clinical response. pgs 1: 15–26.
- 37. Bader J (2001) The relative power of SNPs and haplotype as genetic markers for association tests. pgs 2: 11–24.
- 38. Liu T, Johnson J, Casella G, Wu R (2004) Sequencing complex diseases with HapMap. Genetics 168: 503.
- 39. Rha S, Jeung H, Choi Y, Yang W, Yoo J, et al. (2007) An association between RRM1 haplotype and gemcitabine-induced neutropenia in breast cancer patients. The Oncologist 12: 622.