Using Information Interaction to Discover Epistatic Effects in Complex Diseases

It is widely agreed that complex diseases are typically caused by the joint effects of multiple instead of a single genetic variation. These genetic variations may show stronger effects when considered together than when considered individually, a phenomenon known as epistasis or multilocus interaction. In this work, we explore the applicability of information interaction to discover pairwise epistatic effects related with complex diseases. We start by showing that traditional approaches such as classification methods or greedy feature selection methods (such as the Fleuret method) do not perform well on this problem. We then compare our information interaction method with BEAM and SNPHarvester in artificial datasets simulating epistatic interactions and show that our method is more powerful to detect pairwise epistatic interactions than its competitors. We show results of the application of information interaction method to the WTCCC breast cancer dataset. Our results are validated using permutation tests. We were able to find 89 statistically significant pairwise interactions with a p-value lower than . Even though many recent algorithms have been designed to find epistasis with low marginals, we observed that all (except one) of the SNPs involved in statistically significant interactions have moderate or high marginals. We also report that the interactions found in this work were not present in gene-gene interaction network STRING.


Introduction
The availability of ever more extensive genetic information has spurred intense research on the search for the genetic factors that influence common complex traits. Genome Wide Association Studies (GWAS) aim at discovering associations between genetic factors and complex traits such as diseases. In GWAS, hundreds of thousands of Single Nucleotide Polymorphisms (SNPs) are analyzed to determine whether they are associated with the disease or conditions of interest. Due to limitations on the data, these analyses are usually performed using single SNP statistical tests and correcting for multiple testing.
This approach has severe limitations since epistatic interactions of SNPs are very important in determining susceptibility to complex diseases. Existing methods for SNP interaction discovery perform poorly when marginal effects of disease loci are weak or absent. As an example of a case where this may happen, it has been suggested that many genes with small effects rather than few genes with strong effects contribute to the development of asthma. The problem is that the individual effects of the interacting SNPs may be too small to be detected with the most commonly used statistical methods. Therefore, there is a need for more powerful methods that are able to identify interactions between SNPs with low marginal effects.
A number of different methods have been used to find epistatic interactions, including statistical methods (e.g. ATOM [1]), search methods (e.g. BEAM [2] or SnpHarvester [3]), regression methods (e.g. Lasso Penalized Logistic Regression [4]) and machine learning methods (e.g. [5], MegaSNPHunter [6] or decision tree based methods [7]). Classic methods such as Logistic Regression have been pointed as appropriate methods to consistently estimate the strength of association between a predictor and disease [8] [9]. An additional advantage of regression methods is the ability to deal with population structure by including principle components as covariates [10]. However it has also been stated that logistic regression has limited power for modelling high-order non-linear interactions that are likely important in the etiology of complex diseases [11]. Multivariate regression is not suited for problems with hundreds of thousands of variables. Intermediate strategies that permits fast computation while preserving the spirit of multivariate regression have been explored [4] [12]. The lasso penalty is an effective device for continuous model selection, especially in problems where the number of variables far exceeds the number of observations. The problem is that even these intermediate regression strategies are not prepared for dealing with the interactions between a large number of variables. That is why these methods are usually combined with filtering strategies such as selecting SNPs with high marginals and building the regression model considering only the interactions between these high-marginal SNPs [4]. However in this paper we chose not to use any filtering strategy in order not to loose any potentially important SNP on this step.
In this paper, we describe results obtained using a measure known as information interaction, that was used to identify pairs of SNPs that, together, provide a significantly higher risk of disease. Information interaction expresses the amount of information (reduction of entropy) provided by a set of variables, beyond that which is present in any subset of those variables. Using a computationally efficient approach, we compute the pair of SNPs with the highest information interaction, relatively to the phenotype of interest. To control for multiple hypothesis testing, the significance of the relevant results is validated using a permutation testing procedure. Information interaction has already been applied to epistasis detection [13] [14], however, those applications involved the use of a filtering step that may remove important SNPs. We show in our paper that it is possible to apply information interaction to the WTCCC breast cancer dataset without any filtering step.
The results obtained in artificial datasets show that the approach vastly outperforms existing methods such as SnpHarvester [3], and Beam [2]. Results obtained in the Wellcome Trust Breast Cancer dataset [15] have shown that the approach is applicable to the large amounts of data used in GWAS and that previously unknown, statistically significant, interactions can be discovered. We show also that the interactions that were statistically significant are composed of SNPs with moderate or large marginals.

Materials and Methods
To test methods to find interactions with low marginals, we need to have a dataset that simulates these type of interactions. We selected simulated datasets that were tested with the SNPHarvester algorithm [3]. We used datasets in which there are multiple disease loci without marginal effects. Our argument is that if our method can detect these interactions in the artificial datasets, then it will be able to find similar types of interactions in real datasets such as the Wellcome Trust Breast Cancer Dataset.

Artificial datasets
60 different epistatic models were the base for the generation of datasets in which disease loci do not have marginal effects. These epistatic models, firstly used in [16] use different parameter values for parameters such as heritability (h 2 ) and minor allele frequency (MAF). Heritability ranges from 0.025 to 0.4 and MAF ranges from 0.2 to 0.4. 100 datasets were generated for each disease model, each one with 200 cases, 200 controls and 1000 SNPs. In these datasets there is only a pair of interacting SNPs in positions 1 and 10 from left to right.
The datasets in which there are multiple disease loci without marginal effects try to simulate the expectation that there might exist multiple SNP-SNP interactions in the association studies. Eight hybrid models were used for the generation of the datasets. Since these datasets with multiple loci are closer to what we may expect in reality, we will report the results of our methods in these datasets generated with hybrid models.

Wellcome Trust Case Control Consortium breast cancer dataset
The artificial datasets are very important to test the methods and estimate their power. If we can develop a method that successfully detects the associations in the artificial datasets, then we can apply that method to real world datasets in order to detect associations with similar characteristics. The Wellcome Trust Case Control Consortium (WTCCC) breast cancer dataset is a large real world dataset with 1045 cases, 1476 controls and 15436 SNPs. We used only the SNPs that were also used in the WTCCC original publication [15] after data cleaning. Due to efficiency reasons it was also necessary to convert genotypes into binary variables. Each SNP is therefore coded with 2 bits which is enough since we use 3 values for the genotypes and 1 value for missing data. This encoding of the SNPs is not the same that was used in other works such as [13] or [14]. The reason for this is that the source code of the tool uses bitwise operations that make the execution to run much faster. As we will see in the results in the artificial datasets, our method, using this encoding, overcomes state-of-the-art methods BEAM and SNPHarvester in terms of power to detect pairwise epistatic associations.

Classification methods
Our first approach was to select for study one artificial dataset that contains multiple disease loci, without marginal effects, and apply classification methods to see if it was possible to explain the phenotype based on the 1000 SNPs (10 of which are the disease loci).
We used WEKA to test several classification methods such as Alternating Decision Trees, Voted Perceptrons and Support Vector Machines. We also tested SMLR method (Sparse Multinomial Logistic Regression).
Alternating Decision (AD) Trees [17] are a generalization of the well known decision tree learning algorithm. The AD Tree learning algorithm is based on boosting. Boosting is a machinelearning meta-algorithm that tries to learn a strong classifier, based on weak classifiers. The models produced by AD Tree learning are relatively easy to interpret, especially if compared with other boosting procedures. AD Trees were originally proposed for binary classification, but posterior work extended it to work for multiple classes [18].
Voted perceptrons [19] are an evolution of the perceptron algorithm which is an Artificial Neural Network with only one neuron. Voted perceptrons store the list of all prediction vectors generated after each mistake. For each vector the method keeps the number of iterations it survives until the next mistake is made. This number is called the weight of the vector. To calculate a prediction, the binary prediction of each one of the vectors is computed and all predictions are combined by a weighted majority vote. Support Vector Machines (SVMs) [20] are the basis of classification method based on creating a maximum-margin hyperplane in a transformed input space. The transformation of the input space is done implicitly by means of a kernel function. The parameters of the solution hyperplane are derived from a quadratic programming optimization problem using algorithms such as Sequential Minimal Optimization (SMO) [21].
Sparse Multinomial Logistic Regression (SMLR) [22] algorithm learns a multi-class classifier and simultaneously performs feature selection to identify a small subset of features relevant to the decision. The learned classifier reports the probability of a sample belonging to each of the m classes given m sets of feature weights, one for each class.

Fleuret method
Fleuret published a fast binary feature selection technique (that we call the Fleuret method) based on Conditional Mutual Information Maximization (CMIM) [23] that we wanted to apply to our problem of detecting interacting SNPs with low marginals. This method is based on picking features that maximize their mutual information with the class to predict, conditional to any feature already picked. According to the author, this method selects features which are both individually informative and twoby-two weakly dependant.
The goal of this method is to select a small subset of features that carries as much information as possible. To measure the amount of information carried by the features, the Fleuret method uses Conditional Mutual Information. This information theory measure is based on another very important measure which is the Entropy H(U) of a random variable, which quantifies the uncertainty of U. The conditional entropy H(UDV ) quantifies the remaining uncertainty of U, when V is known. If U is a deterministic function of V then H(UDV )~0. On the other hand if U and V are independent, knowing V does not tell anything about U and H(UDV )~H(U). The Conditional Mutual Information is given by Equation 1: This value can be seen as the difference between the average remaining uncertainty of U when W is known and the same uncertainty when both W and V are known. If V and W carry the same information about U, the conditional mutual information is zero. On the other hand if V brings information about U that is not already contained in W , I(U; V DW ) is different from zero.
The standard implementation of the algorithm is done by keeping a score vector s which contains for every feature X n , after the choice of feature v(k), the score s½n~min lƒkÎ I(Y ; X n DX v(l) ). The score table is initialized withÎ I(Y ; X n ) and the algorithm picks at each iteration the feature v(k) with the highest score. Every score s½n is then refreshed by taking the minimum of s½n and I I(Y ; X n DX v(k) ). Fleuret proposed a faster implementation which is based on the fact that the score vector can only decrease when the process goes on and bad scores may not need to be refreshed. This faster implementation produces the same results as the standard implementation described.

Information interaction pairwise search method
The application of a systematic search over all possible pairs of SNPs came as a natural idea after the bad performance of the Fleuret method. If we are not able to have any type of information to guide the choice of our first disease locus, that means we need to consider all the pairs of SNPs, to decide which ones may be interacting in a way that help us to explain the phenotype.
The choice of information interaction as the metric to evaluate if a pair of SNPs is associated with the phenotype arose as a natural option. We define that a pair of SNPs If the information that SNP A provides about class Y is higher if we know SNP B than it is if we do not know SNP B, then this additional information is the interaction information or synergy between the two variables A and B with respect to class Y .
In our experiments in Breast Cancer dataset, the user-defined threshold T is defined by running 1000 permutation tests and selecting the highest interaction information value found. This way, all our selected interactions will have a p-value greater than 0.001.

Mutual information search method
We also applied mutual information to all SNPs in order to find single SNP associations to the phenotype and be able to compare the results with information interaction method.
Similarly to the information interaction method, a SNP A is associated with the phenotype with a user-defined threshold T if and only if I(A; Y )wT. I(A; Y ) is the mutual information and expresses the amount of information that is shared between A and Y (Equation 3).
In our experiments in Breast Cancer dataset, the user-defined threshold T is defined by running 1000 permutation tests and selecting the highest mutual information value found. This way, all our selected single SNP associations will have a p-value greater than 0.001.

Permutation testing
Permutation testing is a non-parametric procedure for determining statistical significance based on rearrangements of the labels of a dataset. It is a robust method but it can be computationally intensive. A test statistic, which is computed from the dataset, is compared with the distribution of permutation values. These permutation values are computed similarly to the test statistic, but under a random rearrangement of the labels of the dataset. Permutation tests can help reduce the multiple testing burden [24] and can be used to compare statistical tests [1].
In bioinformatics, permutation tests have become a widely used technique. The reason for this popularity has to do with its nonparametric nature, since in many bioinformatics applications there is no solid evidence or sufficient data to assume a particular model for the obtained measurements of the biological events under investigation [25].

Application of classification methods to artificial datasets
The application of the classification methods to our selected artificial dataset did not produce good results. In fact, as we can see from Table 1 none of the methods achieved a 10-fold cross validation accuracy above 56.5%.
We also performed experiments using feature selection methods in WEKA. The 10-fold cross validation accuracies improved, but none of the variables selected by the feature selection methods corresponded to the disease loci. We tried several feature selection methods available on WEKA and none was able to identify some of the 10 disease loci.
A question that arised was how the performance of classifiers behaves when we add more noisy variables. To study that we performed a study in which we started by training a classifier with the 10 disease loci. We then add one noisy variable at each step and retrain the algorithm using the same training parameters. Even taking into account that the training parameters are not adjusted when we add more variables, it is interesting to see how the 10-fold cross validation accuracy rapidly degrades with the addition of noisy variables (see Figure 1).
We conclude that classification methods are not capable of identify the relevant factors when many irrelevant attributes are present.

Application of fleuret method to artificial datasets
In our experiments, we used an implementation of the Fleuret Method available on the author's web page http://www.idiap. ch/ , fleuret. This tool is able to perform feature selection with the Fleuret method and train/test a bayesian or a perceptron classifier. We were interested only in checking if the feature selection method was able to find the interacting SNPs of two artificial datasets, one with two interacting loci and the other with a total of 10 disease loci. However, the Fleuret method was not able to detect any of the interacting SNPs on these two datasets.
These bad results may happen because the Fleuret method initializes the score table s with mutual informationÎ I(Y ; X n ). At the first step the algorithm greedily chooses the feature v(k) with the highest mutual information. This first step selects the wrong variable in SNPHarvester datasets, since the disease loci marginals are very low. This has an impact on the following steps since each disease locus does not add new information about the class to the wrongly selected SNP. As a consequence, the algorithm will continue to make wrong decisions.
The first steps are crucial and that is why we propose another approach that is based on testing all the possible pairs and evaluate them with Information Theory measures.

Application of information interaction method to artificial datasets
We applied information interaction metric to all possible pairs of SNPs in one artificial dataset and found that there was a big difference between values of true disease loci pairs and other noisy pairs of SNPs. It was possible to define a threshold that could perfectly distinguish between true disease loci pairs and not associated SNPs. Figure 2 shows the results obtained with Information Interaction method on one artificial dataset. The distinction between true disease loci and other SNP pairs is very explicit making it possible to detect the 5 pairwise interactions. We can see 5 points representing the 5 pairs of SNPs that are causing the disease in this simulated dataset. The minimum information interaction value of the 5 pairs of SNPs is 0:194 while the highest information interaction value of all the other pairs of SNPs is 0:0317. There are however other harder artificial datasets in which it is not so clear to detect all the 5 pairwise interactions as shown in Figure 3.
The results obtained in the artificial dataset, encouraged us to apply to all the datasets that were generated with equal or different simulation parameters. We could then compare the results obtained with the results of SNPHarvester and BEAM methods.
One of the main issues in applying information interaction (II) search is the choice of threshold T. In preliminary experiments we used three variations. The first one was already described and is based on using a fixed value of T to all the datasets. However, in different datasets this value of T might not be the most appropriated. That is why se decided to use a permutation testing strategy to select T. Our method is as follows: For each dataset we run P permutations and select the maximum II value between all the interactions. We multiply that maximum value by a given constant C (we used C~1:2 in our experiments) and that resulting value is the threshold T that we are going to use for that particular dataset. In our preliminary experiments we tested the number of permutations P for threshold selection to be P~100 and P~10. We decided to use P~10 in our experiments.
We performed a comparison of our method with the results reported in [3] both for SNP Harvester and BEAM. In this comparison we used the Information Interaction Method (IIM) with a permutation strategy for threshold selection using C~1:2 and P~10. The results of the comparison are shown on Figure 4.
As we can see, our Information Interaction method (IIM) outperforms both SNP Harvester and BEAM in terms of power. In Hybrid Model 8 our method has more problems in discovering the interactions because the signal is more dispersed in the noise. However, even in those harder conditions, our method performed better than SNP Harvester and BEAM. We also show in Figure 5 the comparison between our method and SNP Harvester in terms of the generation of false positives. As we can see, both methods   generate few false positives. In a total of 800 datasets, SNP Harvester discovers a total of 5 false positives while our method discovers 8.
Information Interaction search showed to be a very powerful method to detect interactions between SNPs with low marginals and is therefore a very valid option to apply to real GWAS data.

Application of regression methods to WTCCC breast cancer dataset
We tried to apply regression on the SPSS tool, but it was hardly possible: due to the high number of variables the software blocked (in a computer with 8 Gigabytes of RAM). We decided then to use statistical tool R in order to perform our experiments with regression and the same problem occurred given the high dimensionality of the parameters space. The methodology that we were able to run successfully was to use Bayesian Information Criterion (BIC) as the model selection criterion. When we learn the model of marginal effects using Lasso Penalized Logistic Regression and BIC, only 38 variables are selected. We then build the data matrix with all the interactions between these 38 selected variables and trained the model with Lasso and BIC. The final model includes 37 (of the 38) marginal variables and 11 interaction variables. This methodology has the disadvantage of eliminating SNPs with small effects that can be involved in interactions with low marginals.

Application of information interaction method to WTCCC breast cancer dataset
After the pre-processing of the WTCCC Breast Cancer dataset, we applied the information interaction method. To determine the threshold T to use, we ran 1000 permutations tests. In each permutation test we applied the information interaction measure to all possible pairs of variables. At the end we selected the highest value h of information interaction between all pairs of SNPs discovered in the 1000 permutation tests. We then used a value of T greater than h so that each interaction discovered is statistically significant at the 0:001 level of confidence. In our permutation procedure h was found to be h~0:008875. We then fixed T~0:009. The distribution of information interaction values, greater than T~0:009, found in the original WTCCC Breast Cancer dataset is shown in Table 2 and Figure 6. Figure 7 shows the network of interactions found with information interaction method in the WTCCC Breast Cancer dataset. Each SNP that was found with IIM is represented in the graph by a node that has the name of the gene to which the SNP belongs to. For each pairwise interaction discovered, we draw an edge between the nodes that correspond to the SNPs involved.
The null hypothesis that is considered in this work is that the phenotype is independent from the SNPs. If this is true, then we can make permutations of the phenotype and the distribution of information interaction values would be similar. However we observe that in our 1000 permutation tests it was not possible to find any II value above 0:009. On the other hand we can see that in our original dataset there are 89 pairs of SNPs that have an II value higher than 0:009. This means that we can use any of these   89 pairs of SNPs to reject the null hypothesis with a confidence of 0:001, which is our p-value.

Application of mutual information method to WTCCC breast cancer dataset
We also applied the mutual information method to the WTCCC Breast Cancer dataset. As already mentioned, we used 1000 permutation tests to determine the threshold T. In each permutation test we applied the mutual information measure to each SNP and the phenotype. At the end we selected the highest value h of mutual information between each SNP and the phenotype in all the 1000 permutation tests. We then used a value of T greater than h so that each association discovered is statistically significant at the 0:001 level of confidence. In our permutation procedure, h was found to be h~0:0078. We then fixed T~0:0079. A total of 90 different SNPs were found to share information with the phenotype with a mutual information value above the fixed threshold T. The distribution of mutual information values greater than T~0:0079, found in the original WTCCC Breast Cancer dataset is shown in Table 3 and Figure 8.

Performance characteristics of information interaction method
The execution of the Information Interaction Method on the Wellcome Trust Breast Cancer dataset takes around 45 minutes in a 2.3 GHz AMD Opteron Processor. The execution of 1000 permutations in a single processor would take about a month, or 3 days if we use 10 processors. We also made some experiments with  an Alzheimer dataset from the Alzheimers Disease Neuroimaging Initiative (ADNI) with 600000 SNPs and the execution after data quality procedures took around four days. We did not complete the execution of the 1000 permutation tests because we could conclude in advance that the results would not be statistically significant. The complete set of 1000 permutation tests would take several years to complete in a single processor, although it could be parallelized. Since most of the time it is not practical to have access to a large number of processors, there is still the need for methods that need less permutation tests in order to measure statistical significance. We are currently working on methods that obtain the estimates of the p-values with a much smaller number of permutation tests.

Discussion
In our work, we adapted the source code from the Fleuret method in order to calculate information interaction over all possible pairs of SNPs. With this approach we benefited from the efficient calculations of conditional mutual information that was already developed. Even though computational efficiency is an important issue, because we want results in feasible time, it was not our major point of interest. We gave priority on designing a method that could give us some guarantees of finding pairwise interactions, in cases where both variables have low marginals, if they exist, even if we have to wait for the results. We saw from the results of the application of our method to the artificial datasets that our method was powerful when compared to other state of the art methods. Therefore, if that type of interactions exist in the real breast cancer dataset, our method would detect them. Another advantage of our information interaction method over stochastic methods is the fact that it is deterministic. One of the problems of stochastic methods is that they can give different results in different runs. Our method makes an exhaustive analysis of pairs of SNPs and gives always the same results.
In the artificial datasets that were used with the information interaction method, all the interactions had low marginals. Each SNP of the pairwise interactions did not have any statistically significant association with the phenotype when considered alone. We showed in our results that our method based on information interaction could find these pairs of SNPs more often than other state of the art methods such as SNPHarvester or BEAM.
The application of the mutual information method to the WTCCC breast cancer dataset found that there were 90 SNPs that individually shared information with the phenotype with a pvalue pv0:001. In addition, as was already reported, the application of the information interaction method to the WTCCC breast cancer dataset found 89 pairs of SNPs that, when considered together, share more information with the phenotype than when considered individually. A total of 49 different SNPs were involved in the 89 interactions. 48 of these 49 SNPs found with information interaction were also found with the mutual  Discovering Epistasis with Information Interaction PLOS ONE | www.plosone.org information method. The only exception was SNP rs660895 from MHC gene that was found with the information interaction method but not with the mutual information method. Table 4 summarizes the SNPs that were found with mutual information and information interaction methods. Even though the scientific literature refers the need for methods to discover low marginal interactions, our results suggest that most epistatic interactions with relevance to breast cancer have moderate or high marginals. In addition none of the interactions discovered with the information interaction method were reported in the STRING network. We believe that this method should be applied to more breast cancer datasets and also to other datasets from other diseases in order to find more about the kind of epistatic interactions that exist in real diseases.