Multivariate Dimensionality Reduction Approaches to Identify Gene-Gene and Gene-Environment Interactions Underlying Multiple Complex Traits

The elusive but ubiquitous multifactor interactions represent a stumbling block that urgently needs to be removed in searching for determinants involved in human complex diseases. The dimensionality reduction approaches are a promising tool for this task. Many complex diseases exhibit composite syndromes required to be measured in a cluster of clinical traits with varying correlations and/or are inherently longitudinal in nature (changing over time and measured dynamically at multiple time points). A multivariate approach for detecting interactions is thus greatly needed on the purposes of handling a multifaceted phenotype and longitudinal data, as well as improving statistical power for multiple significance testing via a two-stage testing procedure that involves a multivariate analysis for grouped phenotypes followed by univariate analysis for the phenotypes in the significant group(s). In this article, we propose a multivariate extension of generalized multifactor dimensionality reduction (GMDR) based on multivariate generalized linear, multivariate quasi-likelihood and generalized estimating equations models. Simulations and real data analysis for the cohort from the Study of Addiction: Genetics and Environment are performed to investigate the properties and performance of the proposed method, as compared with the univariate method. The results suggest that the proposed multivariate GMDR substantially boosts statistical power.


Introduction
Many complex disorders such as asthma, diabetes, cardiovascular disease, Alzheimer's disease, hypertension, mental disorders, and drug addictions are generally multifaceted phenotypes, measured by a set of scales and/or intermediate phenotypes that can be correlated as the outputs of a metabolic network with many interwoven pathways [1][2][3][4]. Longitudinal data at multiple time points may also be gathered for investigating the trajectory of a phenotype [5]. Furthermore, two-stage multiple testing analysis, i.e., multivariate analysis for grouped phenotypes followed by univariate analysis for the phenotypes in the significant group(s), offers a plausible way to increase statistical power because the traditional Bonferroni correction is usually overly stringent [6,7]. Thus, a multivariate approach is highly demanded for tracking down pleiotropic contributors to complex multifaceted disorders and genetic effects underlying dynamic traits, and for increasing test power [8][9][10][11][12]. In addition, the multivariate strategy can also provide an attractive framework for data integration of several datasets of multi-omics features.
On the other hand, as a natural property of complex networks and the widespread intermolecular dependence in gene regulation and biochemical and metabolic systems, the presence of interac-tions appears to be the norm rather than the exception. Evergrowing evidence has pointed to the view that risk factors act in concert, rather than isolated, to affect complex diseases-the joint actions or interactions of multiple genetic and non-genetic factors play an important role in the etiology of complex diseases [13][14][15][16]. The elusive but ubiquitous interactions present one of greatest challenges for genetic epidemiologists because they make the risk factors underlying complex phenotypes elude traditional single factor-based hunting strategies [14,16,17]. A large amount of research has been devoted to the development of new analytical approaches for the detection and investigation of those interactive determinants involved in complex diseases [18].
The data reduction approaches such as the multifactor dimensionality reduction method (MDR) [19][20][21], the combinatorial partitioning method (CPM) [22], and the restricted partition method (RPM) [23], represent a promising tool for a better identification of simultaneous associations and interactions among multiple risk factors. To circumvent the weaknesses of existing combinatorial approaches [24], we developed a generalized MDR (GMDR) statistical framework that is applicable to diverse phenotypes in population-based and family-based studies and allows adjustment for covariates [25,26]. To date, however, most of the new methods in the literature are univariate, only offering to run separate analysis on one trait at a time. Such a separate analytical strategy does not utilize the information contained in multivariate data. In this article, we propose generalized linear model-, quasi-likelihood model-and generalized estimating equations model-based multivariate combinatorial approaches to detecting gene-gene and gene-environment interactions underlying multiple complex traits. We conduct a series of computer simulations to demonstrate the powerfulness of the proposed methodology. Finally, the proposed methodology is illustrated by an application to a set of real data on nicotine dependence (ND) in the cohort from the Study of Addiction: Genetics and Environment (SAGE).

Ethics Statement
The datasets used for the analyses described in this article were obtained from the database of Genotypes and Phenotypes (dbGaP). All the study protocol and forms/procedures were approved by the Institutional Review Board of the University of Alabama at Birmingham.
In the original MDR method for a case-control study [19], a set of m attributes either genes or discrete environmental factors are chose to span an m-dimensional contingency table. Each subject is allocated into a cell in this m-dimensional space based on the observations on these attributes and every nonempty cell can be labeled as either ''high-risk'' if the ratio of cases to controls in the cell is larger than a pre-specified threshold or ''low-risk'' otherwise. A new dichotomous attribute (i.e., a classification model) is formed by pooling the high-risk cells and the low-risk cells into the highrisk group and the low-risk group, respectively, thus changing the space of the data from originally higher dimensions to one dimension. The resulting model is evaluated in its ability to classify the phenotype; accuracy, defined as the proportion of the correct classifications (i.e., cases in the high-risk group and controls in the low-risk group), is a commonly used measure. Cross-validation and/or permutation testing can be integrated into the above process for evaluation of model, and the optimal subset(s) of features can be selected in terms of the classification ability measured by accuracy or its derivatives such as p-value.
While sharing the same variable construction algorithm as in MDR, GMDR uses a general statistic, instead of affection status, to classify the two divergent groups. The statistic of an individual corresponding to a certain cell in a given contingency table can be generally expressed as the product of its membership coefficient belonging to this cell and its residual under the null hypothesis, which will be elaborated in the following subsections, respectively.

Statistical models and residuals under the null model
Multivariate traits can be represented by an appropriate statistical model corresponding to data nature; generalized linear model (GLM) [27], quasi-likelihood model (QLM) [28] and generalized estimating equations (GEE) model [29] are the commonly used models. All the GLM, QLM and GEE model have the same form of linear predictor and link function.
Assume that a set of explanatory variables influence the outcomes and there is an invertible link function l(m)~g relating the mean to the linear predictor, which can be expressed as, where b is the effect vector probably consisting of b 0 , b t and b c for the intercept(s), the target effects of interest (i.e. gene-gene and/or gene-environment interactions), and the covariate effects, respectively, X is the corresponding design matrix consisting of block matrices I, X t and X c , and I is an identity matrix. The linear predictor can be used for various scenarios. For a repeated measurement study, Y i may have the same parameters b and regressor values x, and thus the design matrix X~x, x, Á Á Á , x ð Þ T . In a clustered design or a longitudinal study, the components of Y may share the same b, but have their own regressors x i s, and thus X~x 1 , x 2 , Á Á Á , x C ð Þ T . For grouped phenotypes in a two-stage multiple testing, each Y i has the component-specific predictor values and parameters including x i and b (i) , and thus resulting in the block effect vector and block incidence matrix, respectively, as follows, b~b (1) b (2) . . .
In application to detection of overall effects and/or pleiotropic effects on multiple traits, the components of Y may share the same regressor x, but have component-specific parameters b (i) .
The residuals can be computed under the null hypothesis of no target effects, i.e., b t~0 , and the estimation methods for different models are summarized as follows.
2.1.1. Multivariate generalized linear model. A GLM is characterized by three parts: the response distribution, the linear predictor, and the link function between the linear predictor and the mean of the response variable. Specifically, the density function of a GLM has such a general form, where h is the location parameter vector, Q is the dispersion or scale parameter vector, a( : ), b( : ), and c( : ) are known functions to specify a member of the exponential family. The expectation and variance of Y are, respectively, and are the first-order and second-order derivatives of b(h), respectively. Denote V (m)~b 00 (h), called variance function, to highlight that b 00 (h) depends on m. The link function l(m)~g relates the mean to the linear predictor in Equation (1). The score of likelihood for a set of independent observations y i (i~1, 2, Á Á Á , N) is, where X i 's are the design matrices of observation i, The GLM can be fitted by maximum likelihood (ML) estimation method. Note that estimation of b does not require knowledge of Q, implying that one may first estimate b and then estimate Q based on the estimateb b. The ML estimator can be derived by setting the score equal to zero and solving the resulting estimating equations. The Newton-Raphson method and the Fisher's scoring method are two well known methods implemented with iteratively weighted (or reweighted) least squares method (IWLS) for finding ML estimates in GLMs when there is no closed form of ML estimates available.
The unknown scale parameter Q can be estimated separately from b after computing residuals usingb b. Although this parameter can, in principle, be estimated by maximum likelihood as well, it is more common to use a ''method of moments'' estimator. Unbiased estimator of Q can be obtained via Pearson's Chisquare statistic as, where p is the number of independent parameters estimated in b (p~0 if b is known). Under the null hypothesis of no target effects (i.e., b t~0 ), we fit b b 0 andb b c as well asQ Q to data. Then, the residuals can be computed by, Multivariate quasi-likelihood model. QLM can be used when only partial information on the data features is available: how the mean is related to the explanatory variables and how the variance of an observation is related to its mean. Compared to GLM, QLM only specifies the link function and the relationship between the first two moments but does not necessarily specify the complete distribution of the response variable. In QLM, quasi-likelihood function is constructed to mimic a proper likelihood function. A quasi-likelihood has the same properties as log-likelihood and the quasi-score function can be formulated by differentiating the quasi-likelihood function. The quasi-score behaves like the score in GLMs. Quasi-likelihood models can be fitted using a straightforward extension of the algorithms used to fit GLMs. The estimating equation for the residuals under the null hypothesis is the same as Equation (2).
2.1.3. Generalized estimating equations model. GEE model is an extension of GLM and QLM [29][30][31]. GEE model requires only specifying a functional form for relationships between the outcome variable and the explanatory factors and between the mean and the variance of the marginal distribution, avoiding the need to model the multivariate distribution and covariance structure for data. Specifically, letting Y be a group of response variables, suppose that (1) there is a link function relating the expectation of Y to a linear predictor in Equation (1), l(m)~g; and (2) the variance is a function of the mean, , w i is the scale parameter, and a i ( : ) and V i ( : )i are some known functions.
Considering a set of data y T~( y T 1 , y T 2 , Á Á Á , y T N ) that is decomposed into N strata and the y i 's are uncorrelated with each other, the estimating equations are formed via a set of K score or quasi-score equations, i is a working variance with a given correlation structure, the j th diagonal element, and R(a) is a working correlation matrix that may depend on some unknown parameter vector a. S i does not need to be equal to the true covariance matrix, although the closer it is to the true one, the better precision the estimates will achieve. The estimates of b can be found by solving U(b)~0. Estimation is typically accomplished through a series of iterations between a modified Fisher's scoring algorithm for b and moment estimation of correlation parameters a and scale parameter Q. Given current estimatesâ a andQ Q of the nuisance parameters, the following modified iterative procedure is for b, The working correlation matrix R i (a) and Q are estimated by the method of moments. Using the current values of parameters calculates the current Pearson residuals defined as, Then, Q can be estimated by, The specific estimator for a depends on the choice of R(a); the general approach is by the function of, After fitting the model under the null hypothesis, the residuals can be computed for several different purposes, e.g., a repeated measurement study, a longitudinal study, a clustered design, and multivariate analysis, wherem m i ,D D i andŜ S i are, respectively, the GEE estimates of the mean, matrices D i and S i under the null hypothesis, b t~0 . When all the components of y i have the same target effect parameter b t as in a repeated measurement study, the residuals can be further averaged over these measurements for a better estimation, where 1 is a vector of which all components are 1 and C is the dimensionality of y i .

Membership coefficient
Membership coefficients that form matrix X t in Equation (1) are used to characterize to which cell(s) a subject can be allocated in the space spanned by a set of target factors. To simplify our presentation, we consider here the samples coming from a homogeneous population. For other cases such as samples from an admixed population and family samples, please refer to our reports elsewhere. The membership coefficient of an unrelated subject is defined as an indicator variable,

Multifactor Dimensionality Reduction Algorithm
The statistic can be defined for various scenarios. In the case where the components of y i have the same target effects b t whether or not the predictor vectors of the components are distinct, the statistic of component j in observation i with respect to cell k in a given contingency table can be computed by (treating as individual ij and cell k), where p ijk is the membership coefficient denoting component j in observation i pertaining to cell k in a given contingency table; all r ij s of observation i (j~1, 2, Á Á Á , C) are the same in a repeated measurement study; and all p ijk s of observation i (j~1, 2, Á Á Á , C) for cell k are the same in a repeated measurement study and probably in a longitudinal study. In application to detecting the overall effects and/or pleiotropic effects of determinants on a group of multivariate phenotypes, the statistic is considered as (treating as individual i and cell jk), where p ik is the membership coefficient denoting individual i belonging to cell k in a given contingency table. In detection of the pleiotropic effects, the aggregation is also suggested to use for each individual [32], For convenience of notation, we use S xy~rx p xy hereinafter to denote the statistic of individual x pertaining to cell y by treating ij as a new individual xo and cell k as y in (6), and treating cell jk as a new cell y and individual i as x in (7).
The statistic defined above reflects a putative association between the phenotype(s) of interest and the target factor(s), offering a possibility for variable construction to create plausible new attributes that maximize the residual phenotypic correlation. When the null hypothesis holds true, the association is expected to be zero and the classification model is formed purely by chance. The variable construction process is illustrated with the t-fold cross-validation procedure although such a cross-validation is not always necessary as other techniques such as permutation testing may determine whether a classification model is beyond chance. The MDR process is described briefly as follows. In Step 1 the data are randomly split into t equal or nearly equal parts for t-fold cross-validation. One subset is used as the testing set and the remained as the independent training set. Then, Steps 2 through 5 are run for the training set to construct a new dichotomous attribute and Step 6 for the testing set to evaluate the fitness of the new attribute(s). In Step   The accuracy is defined by, where TP is True Positive, defined as having a high value in the high-valued group, TN is True Negative, defined as having a low value in the low-valued group, FP is False Positive defined as having a low value in the high-valued group, and FN is False Negative defined as having a high value in the low-valued group. The balanced accuracy [33] is also used,  The statistic is by nature quantitative. However, the above accuracy, measured by the proportion of correct classifications in all classifications, only makes use of the qualitative information in the statistics, i.e., whether or not the statistic value of an individual in the high-valued (low-valued) group is larger (less) than the threshold. As an alternative, the accuracy may also quantitatively be measured by the differences in the averaged statistic values between the high-valued and the low-valued groups [26].

Simulation Study
To verify the validity and demonstrate the high statistical power of the new approach, we performed extensive simulations for continuous phenotypes. To simplify our exposition, here we considered a total of 10 unlinked loci with equi-frequent causative alleles, including two, three, and four disease-causing genes. Hardy-Weinberg equilibrium and linkage equilibrium were assumed in the simulations. Genotype data were simulated for 1,000 individuals with minor allele frequency of 0.3. Bivariate traits were simulated under the digenic espistatic (i.e. two functional loci involved), trigenic (i.e. three functional loci involved), and tetragenic (i.e. four functional loci involved) interaction models commonly used in simulation studies, for example antidiagonal model (i.e., genotypes AAbb, AaBb and aaBB were considered as high-value groups and the rest as a lowvalue group), 3 uppercase-letter model (3ULM) (i.e., genotypes with three uppercase were set as a high-valued groups and the rest as a low-valued group), 4 uppercase-letter model (4ULM) (i.e., genotypes with four uppercase letters were set as high-value groups and the rest as a low-value group), in which the marginal effects of each disease locus were close to zero. For each individual, two continuous traits were simulated conditional on genotype data, interaction model, heritability of each trait and correlation between traits. Phenotypes were generated based on the following multivariate linear model, where y ij denotes phenotype j of individual i, m j is the population mean of phenotype j, g ij is the joint effects of surveyed genes (or target environmental factor when GE interactions are involved) in subject i on phenotype j, and e ij is the residual effects with distribution N(0,s ej 2 ).
Genetic effect g ij is defined as follows, g ij~m gj z ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi (1{p)|s gj 2 =p p if the subject is allocated to the high-valued group where m gj is the mean genetic effect of the phenotype j, s gj 2 is   The statistics of all individuals for GEE-based GMDR (GEE-GMDR) were computed based on equation (8) using R software package geepack [34]. Then the GEE score statistics were input into the GMDR software to identify the best interaction model. For the purpose of comparisons, separate analysis of phenotypes was also performed using the original GMDR method. A threshold of 0 was used to determine whether a cell is highvalued or low-valued in both the methods in the subsequent analysis. We conducted an exhaustive search with 10-fold crossvalidation for all possible one-to nine-locus models in our simulations. The average cross-validation consistency (CVC) and test accuracy (TA) as well as their standard error means (SEMs) were summarized based on 200 simulation replicates. Additionally, Type I error rate and statistical power were also calculated on the basis of the null TA distribution for each scenario under GEE-GMDR and GMDR, respectively. The thresholds for TA at 5% and 1% significance levels were determined through an empirical distribution of TA constructed from permutation testing with 1,000 replicates. We randomly permuted the score statistics to generate pseudo samples for its null distribution in which the potential association between the multilocus genotype and the phenotypes of interest. And then the cross-validation GMDR was performed with the permuted samples, namely, the training data were used to identify the model and the testing data were used to evaluate the model. As the testing samples are independent of the training samples in which the model selection is involved, the TA will randomly fluctuate around 0.5 under the null of no target effects. The TAs from the permutations formed an empirical null distribution. Power and Type I error rate were calculated based on the proportion of the significant models detected in 200 and 1,000 simulations, respectively, with its TA values no less than the cutoff point for each scenario across seven levels of the residual correlation.

A Case Study on Nicotine Dependence
To illustrate use of the GEE-GMDR approach proposed here, a real data set from the Study of Addiction: Genetics and Environment (SAGE) was analyzed to identify interactions among genes. A large proportion of SAGE samples were unrelated except a few siblings. After quality control, a total of 3,897 individuals from three subsamples: the Collaborative Study on the Genetics of Alcoholism (COGA) (1,178 individuals), the Collaborative Study on the Genetics of Nicotine Dependence (COGEND) (1,427 individuals) and the Family Study of Cocaine Dependence (FSCD) (1,292 individuals) were obtained. Using Illumina Human 1M platform, 1,069,796 SNP markers were genotyped for each participant. Self-reported ethnicities indicate that about 35% of the participants are black and 65% are white. Detailed genotype information and demographic characteristics of SAGE cohort can be obtained from the database of Genotypes and Phenotypes (dbGaP) through dbGaP accession number phs000092.v1.p. Three common different measurements of ND were selected from the recorded traits: the lifetime score on FagerstrÖ m Test for Nicotine Dependence (FTND), the DSM4 Nicotine Dependence (DSM4ND) and the largest number of cigarettes smoked in 24 hours (MC).
We excluded SNPs that had missing genotype rate .0.1, minor allele frequency ,0.05 and a Hardy-Weinberg equilibrium test p,10 27 using PLINK software [35]. In total, 744,511 SNP markers were left after quality control. A total of 2,082 individuals were available for the phenotypic traits and also passed the quality control. We also generated a pruned subset of SNPs that are in approximate linkage equilibrium with each other using PLINK software. With the SNP information (dbSNP, Build 135) and the remained SNPs passing the control process, 5 SNP markers in the nicotinic acetylcholine receptor (nAChR) a4 subunit (CHRNA4), 3 in the nAChR b2 subunit (CHRNB2), 56 in the neurotropic tyrosine kinase receptor 2 (NTRK2, also known as the tyrosine kinase receptor gene, TrkB), and 18 in the brain derived neurotropic factor (BDNF) were chosen to detect gene-gene interactions among the four genes. In total, 15,120 (563656618) tetragenic interactions with one SNP from each of the four genes were examined.
Owing to the fact that self-identified ethnicity often partially reflects one's genetic ancestral origins, especially for populations that have complicated migration or admixture histories, the principal components analysis was performed using GCTA software for the SAGE data in which both unrelated samples and relatives are included to identify the population structure [36]. The residual score statistics of GEE-GMDR were computed using

Computer Simulations
The Type I error rates under no residual correlation are presented in Table 1 for the digenic, trigenic and tetragenic interactions of GEE-GMDR and GMDR approach. All the estimates of Type I error were close to the nominal levels. The Type I error rates at the 0.05 significance level were 0.040, 0.054 and 0.048 with the digenic, trigenic and tetragenic interactions for GEE-GMDR method. The Q-Q plot of significance level vs. Type I error rate was well consistent with the theoretical expectation for GEE-GMDR method (Figure 1). The results with different residual correlations were similar (data not shown). The simulation results suggested that the new algorithm had correct Type I error rates, supporting the validity of the proposed procedure. Table 2 displays the means and SEMs of both test accuracy and cross-validation consistency for two simulated continuous traits under no residual correlation. As expected, both bivariate analysis of GEE-GMDR and univariate analysis of GMDR always reach maximum test accuracy and cross-validation consistencies at the particular multilocus models with the same gene number as the simulated models, suggesting that the models with two, three and four functional genes could be correctly identified. Compared with the univariate analysis of GMDR, the GEE-GMDR bivariate analysis had higher test accuracy and cross validation consistency under the correct analytical model, for example, the means of test accuracy and cross validation consistency with GEE-GMDR are 0.665 and 10.000 for digenic model, 0.630 and 9.940 for trigenic model, 0.587 and 8.580 for tetragenic model, respectively, whereas those with GMDR for trait 1 (trait 2) are 0.617 (0.618) and 9.940 (9.975) for digenic model, 0.573 (0.572) and 8.655 (8.585) for trigenic model, 0.528 (0.526) and 5.725 (5.715) for tetragenic model. The results show that the GEE-GMDR method can utilize more information of multiple phenotypes and effectively improve test accuracy and cross-validation consistency. The test accuracy, cross validation consistency, and power (Fig. 2) seemed to be decreased for these three interactions patterns (i.e., digenic, trigenic, tetragenic) and this may be partly due to a lower frequency of the high value group and inflation of sampling error with increasing multilocus genotype given a limited sample size. Figure 2 presents the comparison of the power of the two different analysis approaches to detect gene-gene interactions with pleiotropic effects based on a variety of residual correlation coefficients under digenic, trigenic, and tetragenic models. It indicated that the increase in power for multivariate analysis of GEE-GMDR, relative to univariate analysis of GMDR, depended on the residual correlation for trigenic, tetragenic models and the improvement is dramatic in some situations. GEE-GMDR performs particularly better for negative residual correlation coefficients. This observation is consistent with the previous results of multivariate research [6,[37][38][39]. However, the trend is not so apparent for digenic model. It may mostly be due to the extremely high power of digenic model (both approximate 100%) under our simulation set and the difference in power between two analytical methods cannot be distinguished. In some simulated cases with reduced allele frequency or genetic variance, however, we observed the same trend of power decrease with increase in the residual correlation (data not shown).
In summary, GEE-GMDR method maintains appropriate Type I error rate and thus is a valid test. The comparison of the multivariate analysis of GEE-GMDR with the univariate analysis of GMDR to identify combinations of multiple target genes demonstrates that the GEE-GMDR has higher or at least equal power in most situations. The GEE-GMDR method is able to substantially improve test accuracy, cross-validation consistency and power with inclusion of covariance structure of multiple phenotypes in GEE model. Figure 3 from the principal components analysis shows that the SAGE cohort could be clearly separated into three groups that are roughly consistent with self-reported ethnicity (black, white, and admixed). Although the race and ethnicity are thought to reflect unobserved environmental factors such as diet and family information, the self-reported race and ethnicity are not clearly defined and may not reflect the underlying complexity of them. Thus, it is more appropriate to estimate genetic background from genotype data, such as PCA, and use such information in the analysis [40]. The average correlations among the three phenotypes are 0.52 and they are highly correlated with each other.

Application to Nicotine Dependence Data
Using our new multivariate method and univariate method, 15,120 tetragenic interaction models were examined with one SNP from each gene. The normal distribution of the test accuracy was consistent with our expectation (data not shown). The GEE-GMDR method detected the most significant tetragenic interaction model (rs2072660-rs1209068-rs11030134-rs6011770) with an empirical p value of 2.62e-04 and test accuracy of 0.5780, supporting our hypothesis of a possible interaction among CHRNB2, NTRK2, BDNF, and CHRNA4 underlying ND (Table 3). Detailed information on the SNP of the most significant tetragenic interaction model was given in Table 4. The p value of the identified model using univariate analysis was less significant and the test accuracy was lower. Figure 4 displays high-risk and low-risk distribution for each multilocus genotype combination of the identified tetragenic model. The interaction patterns of high risk and low risk cells varied across each of the different multilocus dimensions, which showed evidence of epistasis, or gene-gene interaction.
Recent evidence revealed genetic associations between ND and CHRNA4 [41,42], NTRK2 and BDNF [43]. Biochemical studies have showed that the a4b2-containing nAChR subtype make up the majority of the high-affinity nicotine-binding sites in our brain and that the expression of genes of both subunits are increased in the case of chronic nicotine exposure. The binding of NTRK2 to BDNF regulates the short-term synaptic functions and long-term potentiation of brain synapses. Biological interactions of BDNF with NTRK2 and CHRNA4 with CHRNB2 have been constructed experimentally under vitro and animal models. Statistical interactions among CHRNA4, CHRNB2, BDNF and NTRK2 underlying ND had been discovered in our previous studies [25,44,45]. Our research results using novel multivariate method indicate that there are potential interactions among these four genes. The finding requires further investigation both through in silico analysis and laboratory verification in future.

Discussion
Identification of epistatic and/or gene-environment interactions underlying complex traits variation is one of the important and challenging tasks in human genetics and epidemiology studies. In the literature, joint actions of multiple factors are the norm rather than the exception in the genetic architecture of complex traits [13][14][15][16]. Compared with traditional methods established by an extension of single-factor-based statistical linear model, recently emerging combinatorial approaches such as MDR, the CPM, the RPM and the GMDR could bridge between statistical interaction and biological mechanism, and take biological plausibility into account. Relative to a wealth of univariate methods available, however, the statistical methods for detecting interactions underlying multiple traits are less well developed for these combinatorial approaches.
Complex diseases can be multidimensional and defined by a set of intermediate phenotypes that cluster together and are correlated. Complex traits can often be grouped into symptom groups. In some cases, data are collected from longitudinal studies. Using a univariate analysis, instead of multivariate analysis, will substantially reduce the power or likely yield a misleading result. It is desirable to to extend GMDR method for multivariate data to analyze complex multifaceted disorders and longitudinal data and/or toincrease test power. Here, we have proposed a natural extension of GMDR to multiple phenotypes based on GLM, QLM and GEE models in terms of the nature of data available, named multivariate GMDR approaches. In GLM, we need to specify all the probability distribution, the relationship between the mean and the variance, and the covariance. In QLM, we only need to specify the relationship between the mean and the variance, and the multivariate correlation structure. GEE model requires only the relationship between the mean and the variance of the marginal distribution. GEE, the extension of GLM and QLM, is more flexible, and has been widely applied to correlated multivariate data because of its robustness and no need for specifying the exact covariance structure [29,30,46]. GEE can yield consistent and efficient estimation of coefficient parameters for the multivariate data, even though the assumed working correlation structure is incorrect and the data are missing completely at random [29]. The GEE-GMDR can therefore handle a variety of multivariate data with different correlation structures, for example, repeated data, longitudinal data, cluster data and multiple phenotypes, under the assumption of estimation of arbitrary subset of the covariance parameters [46]. In conclusion, we propose a comprehensive statistical framework for detection of interactions underlying multivariate phenotypes. Within this framework, various kinds of phenotypes with diverse correlation structures, such as categorical data and continuous data can be analyzed. The original GMDR can be regard as a special case of multivariate GMDR, when the cluster size of multivariate data is equal to one.
One of the key advantages of the proposed method, as shown in both the simulation and the real data analyses of this study, is that this new multivariate approach is more powerful via borrowing information across multiple correlated phenotypes. We have compared the power of GEE-GMDR with the univariate GMDR method through Monte Carol simulations under the digenic, trigenic and tetragenic interaction models. Remarkable increase in power has been observed using our proposed multivariate method in the simulations, especially for trigenic and tetragenic models. The lower power of identifying high order interaction may mostly be attributed to smaller sample size, low allele frequency and low genetic variance [26]. Here, the multivariate method can be an appropriate and promising approach to search for weak associations of high order combination of multiple factors underlying complex traits, which may be missing when each phenotype is considered independently. Consistent with other multivariate techniques, we also observed that the power of GEE-GMDR depends on the size and the structure of residual correlation between phenotypes [6,[37][38][39]. The power to identify a pleiotropic gene-gene interaction is greatly improved in a case with a negative residual correlation. Such a pattern is observed in different interaction models.
The proposed approach has been applied to the SAGE sample with inclusion of multiple phenotypes related to ND. Conventional single-marker methods that separate interacting genes from their context fail to interpret the whole genetic architecture and seem to be inefficient. It is appropriate and powerful to identify high order gene-gene interaction underlying the ND, which is controlled by a large number of genes with a modest effect size, as demonstrated in our study and previous reports [25,44,45]. In human genetic studies, we often collect many correlated phenotypes. Correction for multiple testing is also required in individual analyses for multiple phenotypes. However, the proposed method analyzes all of the traits simultaneously and jointly in a unified statistical model, offering a protected overall significance level.
Very recently a multivariate GMDR method based on GEE was just proposed in the literature [32]. However, our statistic is more general and powerful and that method [32] can be viewed as a special case of our approaches. For data from repeated measurement study, our method is identical to that approach, but will be different for longitudinal data, cluster data and multiple phenotypes. Given the difference of joint genetic effect of genes on multiple phenotypes, the proposed score statistic is more appropriate to model relationships between phenotypes. In addition, our proposed method is evaluated through simulation studies and the real data analysis for quantitative traits; whereas the method in Chio and Park [32] is only assessed by the real data studies for few binary traits. The performance of high-order interactions and the influence of residual correlation on power using GEE-GMDR method are also investigated in our study. Further, we also corrected for population stratification in the real data analysis. In this way our proposed method is a unified GEE-GMDR that can handle a variety of data.
As is the same as the original GMDR, computational limitations for high dimensional interactions still remain with our proposed method, especially for identifying high order interaction for data of the genome wide association studies (GWAS). However, Zhu et al have developed a graphics processing unit (GPU)-based GMDR program (GMDR-GPU) [47], which can handle GWAS data and run more faster than the original GMDR software. Through combination of our GEE-GMDR algorithm with GMDR-GPU program, the problem of computational expense can be alleviated.
GEE statistic used in our GEE-GMDR approach makes use of all dimensions of data but sometimes a part of them will dilute useful information. The possible directions for future research are to explore constructing an overall or extracting few summary statistics by multivariate statistical techniques (e.g., principal components analysis and canonical correlation analysis). Principal components may be useful, in particular, when the phenotypes of interest are highly correlated and/or highly dimensional, i.e., nearly collinear or degenerated multivariate data. The principle is to construct the test statistic based on the first few principal components from a principal components analysis or canonical correlation analysis, instead of all components. Principal components analysis or canonical correlations, as a kind of statistical method of linear transformations, are effective to reduce the number of phenotypes surveyed.