Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.


Introduction
Microarray technique is a useful tool for measuring gene expression data across different experimental and control samples. Similarly, beadchip is another efficient technique for generating genome-wide DNA methylation profiling in infinium II platform. DNA methylation is an important epigenetic factor that refers to the addition of a methyl group (-CH3) to position 5 of the cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring in genomic DNA. It modifies, in general decreases, the expression levels of genes. Both the expression and methylation data matrix [1], [2], [3], [4] are initially organized in such a way that rows and columns indicate genes and samples (conditions), respectively. Statistical analysis [5], [6], [7] is an important tool to identify differential expression/methylation (i.e., DE/DM) genes across different types of samples.
Association rule mining (ARM) [8], [9] is another useful tool for determining interesting (expression/methylation) relationships among items (genes) under different conditions (samples). In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special rules of genes and potential biomarkers from the large gene expression and/or methylation data by integrating a novel statistical technique and binary inclusion-maximal biclustering technique, consecutively.
In traditional association rule mining algorithms, huge number of rules is coming out as result. Thus, it is difficult to run them on medium or large sized dataset in which the number of genes is approximately 250 or more. To solve the problem, in our proposed method, we have utilized the binary inclusion-maximal biclustering (i.e., BiMax) technique [10] for mining nonredundant significant itemsets and corresponding special rules. But, the biclustering technique can work on such dataset whose the number of genes is less than equal to 10,000 approximately. If the number is greater than 10,000, it fails to work. Thus, for such large dataset, we have to apply a statistical strategy on the dataset before using the biclustering technique to eliminate the redundant/insignificant/low-significant genes in such way that significance level must rely on the data distribution property (viz., either normal distribution or non-normal distribution). Therefore, first of all, the whole data is passed through different fundamental statistical techniques (viz., removal of genes having low variance and normalization, consecutively). Now, if there is large number of samples in a dataset, there is no need to use any normality test on the data before using any statistical test as all statistical tests perform more or less well for the large number of samples. But, if a dataset has small number of samples, then it has been observed that different statistical tests perform differently [11]. It is well-known that standard t-test, Welch's t-test, Bayes t-test and Pearson's correlation test are all parametric statistical tests, and Limma, significant analysis of microarrays (SAM), Wilcoxon's ranksum test and permuted t-test are considered as non-parametric tests. Some non-parametric statistical tests (like Limma and SAM) are good performers for normally distributed data as well as non-normally distributed data in all conditions, specially for small sample sizes. But, the performance of SAM is found to be inconsistent as sometimes it produces good performance while at other times it fails to work properly for small sample sizes. The performance of permuted t-test is satisfactory in case of non-normal distributions for all types of sample sizes. But, in case of normal distributions, it works poorly especially for small sample sizes. For normally distributed data, the performance of Wilcoxon's ranksum test is much poorer than standard t-test for small sample. On the other hand, for small sample sizes, the standard t-test produces poor performance for nonnormally distributed data where its performance is better in normally distributed data. Performance of Welch's t-test is poor for both the cases of data distributions for small number of samples. To summarize, it can be stated that in case of small number of samples, it is better to test the data distribution in advance [11]. Otherwise p-values may be misleading due to the assumption of incorrect distribution. Therefore, we have initially used a well-known normality test (i.e., Jarque-Bera test [12]) for testing the distribution pattern of each data whether the data is normally distributed or not. Depending on the patterns, the dataset is then partitioned into two sub-datasets, where one sub-dataset has all normally distributed data, and remaining one contains all non-normally distributed data. Now, it is noticed that the parametric tests perform better for normally distributed data than for non-normally distributed data on average. On the other hand, the performance of non-parametric tests is more satisfactory for non-normally distributed data than for normally distributed data on average [11]. Therefore, after testing for normality, we have run multiple parametric statistical tests (viz., t-test [11], Welch's ttest [11], modified Bayes' t-test by Fox and Dimmic [13], and Pearson's correlation test (Corr) [11]) on the normally distributed data to identify differentially expressed/methylated genes and taken their intersection in order to be certain that whichever genes are identified, are truly differentially expressed/methylated. Similarly, we have applied multiple non-parametric tests (viz., Limma [11], significant analysis of microarrays (SAM) [11], Wilcoxon's ranksum test (Wcox) [11], and permuted t-test (Perm) [11]) on non-normally distributed data to obtain differentially expressed/methylated genes, and taken their intersection in order to be certain that whichever genes are identified, are truly differentially expressed/methylated. A list is then prepared containing the resulting intersected genes from both the normally distributed dataset and the non-normally distributed dataset. These statistical methods are utilized to determine the proper significant non-redundant subset of the differentially expressed/methylated genes from the original large dataset. Thereafter, discretization and post-discretization are utilized consecutively on the subset of data for converting it into corresponding boolean matrix. Now, our next major goal is rule mining. For this purpose, the biclustering technique is directly applied on the post-discretized data-matrix for determining maximal homogeneous biclusters of genes as maximal frequent closed homogeneous itemsets (viz., MFCHOIs) at a minimum support-value. Here, MFCHOI means the maximal biclusters that have sets of all homogeneous class-labels/samples. The rules are then extracted from the MFCHOIs. Each evolved rule is of special type, i.e., consequent of the rule consists of its class-label only. Therefore, each MFCHOI produces a single special rule. Our proposed methodology performs better than state-of-the-art rule mining algorithms as it generates maximal frequent closed homogeneous itemsets (viz., MFCHOIs) instead of frequent itemsets. Each of the other rule mining methods is also starting with the same post-discretized data-matrix. Another advantage of it that as these rules are classification rules, so we do not need to calculate any other rule-interestingness measure (e.g. confidence) except support. Therefore, it saves elapsed time and can work on big data in which number of genes is high. Pathway and Gene Ontology (GO) analysis are conducted on the genes of the evolved rules using David database. Furthermore, frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers.
Furthermore, it is also needed to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. For this, we need to perform cross-validation and classification, consecutively on the data to compute average accuracy of the proposed method. Therefore, the earlier mentioned post-discretized data-matrix is divided into training and test sets using 4-fold cross-validations (CVs). Thereafter, the biclustering technique is applied on the training part of the dataset for determining MFCHOIs at a minimum support-value. The special rules are then extracted from the MFCHOIs. Here, each MFCHOI generates a single rule. We have also estimated 23 rule-interestingness measures [14], [15] of the evolved rules. We have also added another new measure (viz., the number of satisfiable conditions/samples of the corresponding bicluster for each evolved rule). We have estimated the rank of each rule according each of the 24 measures individually using Fractional ranking [16][17][18]. The final ranking of each evolved rule is calculated by average ranking on the resulting fractional rankings of the rule. All the rules are rearranged from best to worst case. We have then assigned some weight on the final list of rules in such a way that the topmost rule gets the highest weight, 2nd topper gets 2nd highest weight and so on; and also the weight-interval between any two consecutive ranked rules is same. The classification technique is applied on each test data point using a majority voting technique through weighted-sum method. A comparative performance study with existing popular rule-based classifiers is conducted based on the average classification accuracy, MCC and related factors of them. Our classification method provides better performance than the existing popular rule-based classifiers. Here, each of the other rule-based classifiers is also starting with the same post-discretized data-matrix. Statistical significance tests (viz., one-way ANOVA) [19] are also performed for verifying the statistically relevance of the comparative results.
As we have mentioned earlier that DNA methylation is one of the important epigenetic factors which can change (generally decrease) the expression levels of genes, therefore we have also performed integrative analysis of gene expression dataset and methylation dataset of combined dataset. As we know that the gene expression is inversely proportional to the methylation, so inversely correlated genes make sense to highlight the epigenetic effect (e.g., methylation) on the expression level. Therefore, we have identified these type of genes having inverse relationship between their methylation and expression levels.
The rest of the article is organized as follows. In Section Materials and Methods, literature review and our proposed methodology have been elaborated. Section Results and Discussion presents source and brief description about the real datasets, and the experimental results and discussion. Finally, Section Conclusion concludes the article.

Literature Review
Association rule mining (ARM) [8], [9] is one of the useful tools for determining interesting (expression/methylation) relationships among items (genes) under different conditions (samples). It can provide association rules based on frequent itemsets. A rule (R) can be described as A ) C, where A, C IM and A T C = ϕ. Here, A and C are called as antecedent (i.e., set of items in LHS of a rule) and consequent (i.e., set of items in RHS of a rule), respectively. The support of the itemset (IM) is defined as number of transactions in which all items of it appear together. IM is frequent when its support is greater than any threshold value (i.e., minimum support). The confidence of the rule is defined as ratio of support of IM to the support of A. Frequent closed itemset (FCI) is a condensed form of frequent itemsets. FCI is used to avoid redundancy.
In past decades, traditional Apriori algorithm [20] was most fundamental association rule mining technique. Apriori uses a bottom-up technique in which frequent subsets are extended one item at a time for determining each candidate itemset. Groups of the candidate itemsets are then tested in the data. The method terminates if no further successful extension is found. The result of Apriori is the sets of rules which determine the occurrence of items in the dataset. Apriori follows breadth-first search for counting the candidate itemsets. Apriori generates candidate itemsets having length k from the itemsets having length k − 1. It discards infrequent candidate itemsets. The set of candidate itemsets have all frequent itemsets. After extracting all frequent itemsets, corresponding set of rules is mined from each frequent itemset. As Apriori generates only frequent itemsets, thus huge number of rules are produced from the itemsets. Therefore, Apriori can not run on medium or large size of data. It can hardly work up to 100 genes, approximately. But, if there is more than 100 genes, then it either takes a long time or fails to run.
After further investigations, different shortcomings have been identified in the traditional Apriori, like production of high number of frequent itemsets, high running time, problem of multiple-scan of the dataset etc. Many other ARM techniques have been proposed (e.g., AprioriTid [21], Eclat [22], Tao et al. [23], H-mine [24] etc.) to reduce these shortcomings. But, for medium or large sized dataset (i.e., whose the number of genes is greater than 250 approximately), either the methods fails to work on the dataset or they take a long time (viz., approximately 5 hours or more).
For solving the above limitation, in this article, we have used the BiMax biclustering technique [10] for extracting maximal frequent closed homogeneous itemsets (MFCHOIs) and corresponding special rules. It is a method for identifying groups of all-1 biclusters from a boolean data matrix under certain conditions. The aim of the biclustering is to discover groups of genes (i.e., all-1 biclusters) having similar behaviour under a subset of conditions (samples). The biclustering technique extracts the maximal frequent closed homogeneous itemsets (viz., MFCHOIs) which are proper subsets of frequent itemsets (FIs); i.e., MFCHOI & FI. Thus, Our proposed method produces much less number of significant non-redundant itemsets than the other rule mining algorithms. But, the biclustering technique can work on the dataset in which the number of genes is less than equal to 10,000 approximately. If the number is greater than 10,000, it can not work on the dataset. Therefore, for the large dataset, we need to utilize some statistical strategy on the dataset before applying the biclustering technique for eliminating the redundant/insignificant/low-significant genes in such a way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). Hence, here, we have proposed a computational rule mining framework, StatBicRM for producing special rules of genes, and potential biomarkers from the large gene expression and/or methylation dataset by integrating a novel statistical technique and the biclustering technique, consecutively.

Proposed Method
Our proposed technique, StatBicRM is basically a computational framework for rule mining where integrated approach of statistical and binary inclusion-maximal biclustering techniques are utilized in gene expression or methylation dataset (see Fig. 1). Besides this, we have also performed classification using the proposed method to know how much the evolved rules are able to describe accurately the remaining test (unknown) data (see Fig. 2).
The steps of StatBicRM is described briefly in the following steps: Identification of differentially expressed/methylated genes using Statistical tests. Our proposed method basically depends on statistical analysis. As we know that in case of big gene expression/methylation dataset, there may exist 10,000 or more genes. Among them, most of the genes are non-differentially expressed/methylated (i.e., nDE/nDM), and only some of them are differentially expressed/methylated (i.e., DE/DM). When a rule is generated, then these two types of genes may occur together in the rule. According to biological scenario, DE/DM genes can only make sense in a rule relating to specific disease, where the other type of genes is irrelevant to the disease. Therefore, we have initially used a novel statistical strategy on the dataset to identify the set of statistically significant non-redundant DE/DM genes in such way that significance level must rely on the data distribution property (viz., either normal distribution or nonnormal distribution). For doing this, at first, the genes which have low variance are eliminated from the gene expression/methylation dataset. Thereafter, we have used zero-mean normalization on the data of these genes to adjust the values measured on different scales to a common scale. The zero-mean normalization can be stated as: where μ and σ denote mean and standard deviation of the expression/methylation data of a gene i before normalization respectively; and v ij and v 0 ij refer to the value of i-th gene at j-th condition before and after normalization, respectively.
It is well known that the parametric statistical tests [25] are appropriate for normally distributed data, and non-parametric statistical tests [25] are appropriate for non-normally distributed data, respectively. Therefore, Jarque-Bera normality test [12], [26] is utilized on the normalized data to determine the pattern of distribution of the data whether it is normally distributed or non-normally distributed. The Jarque-Bera normality test is defined as follows: where d denotes the degree of freedom, S is the skewness of the sample, and K refers to the kurtosis of the sample. Hence, depending on the resulting distribution patterns, the whole normalized dataset is partitioned into two sub-datasets, where one sub-dataset has all normally distributed data, and remaining one contains all non-normally distributed data. Thereafter, we have applied four parametric statistical tests (viz., t-test [11], Welch's t-test [11], modified Bayes' t-test by Fox and Dimmic [13] in 2006 and Pearson's correlation test (Corr) [11]) on the normally distributed data to obtain differentially expressed genes for the normally distributed sub-dataset. Similarly, four non-parametric tests (viz., Limma [11], Significant analysis of microarrays (SAM) [11], Wilcoxon ranksum test (Wcox) [11] and permute ttest (Perm) [11]) are applied on the non-normally distributed data to obtain differentially expressed genes for the non-normally distributed sub-dataset.
Before further proceeding, we have shortly discussed in the followings about some of the statistical tests mentioned above.
The "2-sample t-test" makes comparison between means of the two groups with the variation in the data. From the test statistic, we compute a measure (i.e., p-value). The p-value indicates the probability of observing a t-value as large or larger than the actually observed t-value where the null hypothesis is given true. By convention, if the p-value of a gene (item) is less than 5%, then the gene is statistically called as differentially expressed/methylated gene. Now, suppose, for each gene g, group 1: n 1 treated samples, with mean x 1g and standard deviation s 1g ; and group 2: n 1 controlled samples, with mean x 2g and standard deviation s 2g .
Here, se g denotes the standard error of the groups' mean, thus, where sPooled is the pooled estimate of the population standard deviation; i.e., Analyzing Large Gene Expression and Methylation Data Using StatBicRM Here, df is degree of freedom of the test. It is stated as df = (n 1 + n 2 − 2). This strategy is used assuming that variance of two groups are equal. For Welch's t-test, the variance of two groups are checked whether they are equal to each other or not. If equal, then use earlier mentioned t-statistic in Equation 3, otherwise use the following t-statistic: Here we use unpooled estimates of the population standard deviations. Pearson's correlation coefficient (commonly denoted as ρ) between two variables is described as the covariance of the two variables divided by the product of their standard deviations, i.e., where covðx; yÞ ¼ where samplesize for the two groups are n1 and n2, respectively (here, n1 = n2). This test can predict whether two variables are related or not. The moderated t-statistic in Limma [27] can be demonstrated as: where samplesize n = n 1 + n 2 ,b g ands 2 g denote the contrast estimator and posterior sample variance for the gene g respectively. The statistic for calculating contrast estimator for gene g is: where, N is normal distribution, and the statistic for estimating posterior sample variance for the gene g is: Where, d 0 (< 1) and s 2 0 refer to the prior degrees of freedom and variance respectively, and d g (> 0) and s 2 g denote the experimental degrees of freedom and the sample variance of a particular gene g, respectively.
SAM chooses to add a small positive constant s 0 (stated as "fudge factor") to solve small variance problem. The SAM statistic by Tusher et al.(2001) is: where se g is the standard error of the groups' mean (see Equation 4). Here, sPooled is the pooled estimate of the population standard deviation (see Equation 5). Here, df is degree of freedom of the test. It is stated as df = (n 1 + n 2 − 2). In Wilcoxon ranksum test, a list of ranks of the gene expression values for each gene is prepared in ascending for each group, and then tests for equality of means of the two ranked samples. The z-statistic of the test is: where T ¼ min P ranks group1 ; P ranks group2 ; ð14Þ and A permuted t-test is a kind of t-test in which an rearrangement is conducted in the labels on the observed data-points of each gene (item).
However, as stated earlier that the four parametric statistical tests are applied on the normally distributed dataset, thus different number of up-regulated and down-regulated genes are coming out from the different parametric tests. Thereafter, we have performed intersection of the up-regulated genes to identify set of common up-regulated genes (denoted by UPDESET N ) for the normally distributed sub-dataset. Similarly, we have got set of common down-regulated genes (denoted by DOWNDESET N ). We have then made a list (denoted by TOTALDESET N ) containing all the common up-regulated genes and all the common down-regulated genes; i.e., TOTALDESET N = UPDESET N + DOWNDESET N . Similarly, as stated earlier that the four nonparametric statistical tests are applied on the non-normally distributed dataset, thus different number of up-regulated and down-regulated genes are coming out from the different nonparametric tests. Then, we have made intersection of the up-regulated genes to identify set of common up-regulated genes (denoted by UPDESET NN ) for the non-normally distributed subdataset. Similarly, we have got set of common down-regulated genes (denoted by DOWNDE-SET NN ). We have then made another list (denoted by TOTALDESET NN ) containing all the common up-regulated genes and all the common down-regulated genes; i.e., TOTALDESET NN = UPDESET NN + DOWNDESET NN .
Finally, we have produced a final list (denoted by TOTALDESET (N+NN) ) containing all the common up-regulated and down-regulated genes from the normally distributed and non-normally distributed datasets; i.e., TOTALDESET (N+NN) = TOTALDESET N + TOTALDESET NN . Hence, the final list of genes (i.e., TOTALDESET (N+NN) ) are utilized in the next step. Discretization and Post-discretization. Suppose, the data matrix of the resulting list of genes, TOTALDESET (N+NN) is denoted by I. Now, first of all, I whose rows denote genes and columns denote samples, is transposed. Suppose, PIT is the transposed matrix. As the PIT matrix is already normalized by zero-mean normalization, therefore the following step is utilized for binary discretization of the matrix: where PITij denotes the expression/methylation value of i-th row and j-th column (1 i m, 1 j n), m and n are number of rows (samples) and number of columns (genes) of PIT ij matrix, respectively, and IT is the resulting discretized matrix. Now, let us assume that in the discretized boolean matrix, a up-regulated gene and a down-regulated gene are denoted by DE up and DE down , respectively. In the matrix IT, '1' and '0' refer to presence of up-regulated gene (DE up ), and presence of down-regulated gene (DE down ), respectively (see part (b) of Fig. 3). After discretization, we will apply Bimax biclustering for finding all-1 biclusters. As the Bimax biclustering rectifies only '1', not '0', thus we need to do post-discretization in such way where '1' will represent both DE up and DE down properties. Therefore, after discretization, number of columns is doubled where the first half is a domain for DE up property, and remaining half is another domain for DE down property (see part (c) of Fig. 3). E.g., the column denoted by g1 in part (b) of Fig. 3 is divided into the two columns denoted by g1+ and g1− in part (c) of Fig. 3, where for the g1+ column, '1' denotes presence of up-regulated gene (DE up ) and '0' denotes absence of up-regulated gene (* DE up ), and for the g1− column, '1' denotes presence of downregulated gene (DE down ) and '0' denotes absence of down-regulated gene (* DE down ). Hence, for methylation data, DM hyper and DM hypo are used instead of DE up and DE down , respectively. Note that in this paper, '+' and '-' denote up-regulation/hyper-methylation and down-regulation/hypo-methylation, respectively.
Dividing whole data into training and test sets. Let us assume that the post-discretized matrix is denoted by ITb. For classification of the matrix ITb, we have applied 4-fold cross-validations (CVs) on the matrix to divide it into test and training data, where one-fold of ITb will be used as test set and remaining three fold will be considered as training set. This procedure will be repeated for four times as it is 4-fold CV.
Finding maximal biclusters and extracting special rules. We have transposed the training boolean dataset, and applied Bimax biclustering to identify maximal frequent closed homogeneous itemsets (MFCHOIs). Before further proceeding, the fundamental method of BiMax biclustering is discussed in short.
Suppose, a boolean matrix e has size of n × m, where n is number of genes and m is the number of samples. A cell e ij is 1 if gene i expresses differentially in the sample/condition j and otherwise, e ij is 0. A bicluster (G, S) is a subset of genes G {1, 2, . . ., n} which express differently together under a subset of samples S {1, 2, . . ., m}; i.e., the pair (G, S) refers to a subset of the matrix e whose all elements have 1. The biclusters which are inclusion-maximal (i.e., the biclusters that are not entirely part of any other bicluster), are only interesting. The pair (G, S) 2 2 {1,2,. . .,n} × 2 {1,2,. . .,m} can be stated as a bicluster of the type inclusion-maximal [10] if and only if (i) e ij = 1, 8iG, jS, and (ii) When there is no proper superset of an itemset have been found at the same support value, then the itemset is called closed itemset. Finding the set of frequent itemsets is totally equivalent to get a set of all-1 biclusters each having at least number of conditions/samples (i.e., satisfying minimum support).
In our experiment, for the biclustering, we have set a fixed minimum cutoff of items/genes (viz., 2), and different minimum cutoffs of sample/condition for determining itemsets at different minimum support of each rule. The BiMax biclustering can generate all maximal biclusters. The items/genes of maximal biclusters represent a (maximal) closed itemset. Thus, all extracted biclusters that are satisfying minimum support condition, produce the set of (maximal) frequent closed itemsets with their class-labels (i.e., conditions/types of samples). Thereby, we have to filter the biclusters depending on their conditions. We have selected such (maximal) biclusters which have the group of homogeneous (non-contradictory) conditions. In other words, we have to identify maximal frequent closed homogeneous itemsets (MFCHOIs). E.g., Bicluster 1 is a MFCHOI which has three genes g1+, g2− and g3+, and two homogeneous conditions/samples s tr1 and s tr3 (presented in part (d) of Fig. 3). Similarly, Bicluster 2 is another MFCHOI which has three genes g1+, g2+ and g3−, and two homogeneous conditions/samples s nr1 and s nr3 (presented in part (e) of Fig. 3). Hence, we have omitted the biclusters that have the group of heterogeneous (contradictory) conditions. E.g., Bicluster 3 is such type of heterogeneous (contradictory) bicluster which has three genes g1−, g2+ and g3+, and two heterogeneous conditions s tr2 and s nr2 (presented in part (f) of Fig. 3).
Ranking of rules. We have evaluated each evolved rule based on 24 rule-interestingness measures. Support, confidence, coverage, prevalance, sensitivity (or, recall), specificity, accuracy, lift (or, interest), leverage, added value, relative risk, Jaccard, Yules' Q, klosgen, Laplace correction, Gini index, two-way support, linear correlation coefficient (or, ϕ-coefficient), cosine, least contradiction, Zhang, liverage2 (or, Piatetsky-Shapiro) and kappa [14,15] are already included among the 24 measures (see S1 Text). The last and novel measure is the number of satisfiable conditions/samples to each evolved rule. E.g., according to part (g).(i) of Fig. 3, the value of the measure of rule id 1 (i.e., {g1+, g2−, g3+} ) disease) is 2 as its corresponding bicluster (in part (d) of Fig. 3) has two conditions (s tr1 and s tr3 ). Similarly, according to part (g).(ii) of Fig. 3, the value of the measure of rule id 2 (i.e., {g1+, g2+, g3−} ) normal) is 2. The rank of the rule is proportional to the value of the measure of it (i.e., if a rule that has higher value of the measure than other rule, then the rank of the rule will be better than the second rule).
Thereafter, the evolved rules are ranked according to each of the 24 rule-interestingness measures individually using Fractional ranking [16][17][18]. In the fraction ranking, items which compare equal, hold the same rank. This rank is the mean of ranking numbers that are received in ordinal ranking. E.g., suppose, a data set is {1 2 2}. Here, only two different numbers are available, so there should be two different ranks. If 2 and 2 are actually different numbers, then they should hold ranks 2 and 3, respectively. As these two numbers are same, thus we should calculate their rank by making the average of their ranks as follows: (2+3)/2 = 2.5; therefore, the fractional ranks will be: 1 2.5 2.5.
Hence, the final ranking of each rule is determined by average ranking on the resulting fractional rankings of the rule. The rules are then rearranged in ascending order (i.e., from best to worst rank).
Assigning weights to the rules w.r.t. their final ranking. For classification, we have to apply a majority voting technique on each test data point to identify its class-label through weighted-sum method. Thus, we firstly assign some weight on the final list of rules in such a way that the topmost rule gets the highest weight, 2nd topper gets 2nd highest weight and so on. The weight of the first ranked rule is always 1. The ranges of weight lie in between 0 and 1.
The weight of each rule (denoted by w j , 1 j p) is estimated from a function of its final rank of the rule (denoted by r j ) and the total number of rules (viz., p) as described below: Here, the weight-interval between any two consecutive ranked rules is same. Thus, the calculated weights of the rules are normalized using zero-mean normalization (in Equation 1). Majority voting and classification. Consider one test data point (ts). For determining of the predicted class-label of it, we have applied 'majority voting' technique. At first, we have identified the rules whose all the items (genes) in their antecedent sides exist in ts. The weights of the rules (i.e., trR ts number of rules) which have only the class-label 'disease' in their consequent sides are then summed up (viz., Ws_tr ts ). Similar summation (viz., Ws_nr ts ) is performed for the rules (i.e., nrR ts number of rules) having only the class-label 'normal' in their consequent sides. The two weighted-sum are then compared, and the class-label with higher weighted-sum becomes the predicted class-label of ts (viz., PredCls ts ). But, if both the weighted-sum are equal to each other, then the class-label of the top rule which satisfies ts (i.e., ClsTopR ts ), becomes the predicted class-label of it. In case, if there is no such rule which satisfies that ts, then the class-label of the rule which satisfies maximum number of test points (i.e., Cls R 0 ) becomes the predicted class-label of it. An example of majority voting technique is presented in Fig. 4. Repeat this step for other test points. This process is then repeated 4 times for 4 sub-matrices of the test data as here 4-fold CV is used. Using this technique, we have calculated true positive (TP), true negative (TN), false positive (FP), false negative (FN), sensitivity [7], specificity [7], accuracy [7] and Mathews correlation coefficient (MCC) [7] for the proposed classification. Sensitivity, specificity, accuracy and MCC are defined in the followings, respectively: we have repeated the 4-fold CV for 10 times, and then their average sensitivity, average specificity, average accuracy, and average MCC are calculated with the standard deviations based on the results of cross-validations. Thereafter, a comparative performance analysis has been conducted between our proposed method (i.e., Prop) and other popular rule-based classifiers (i.e., ConjunctiveRule (CJR) [28], DecisionTable (DT) [28], JRip [28], OneR [28], PART [28] and Ridor [28] implemented in Weka 3.6 software) based on their average sensitivity, average specificity, average accuracy, and average MCC. Note that the other rule-based classifiers are also started with the same post-discretized matrix (i.e., ITb). We have also performed significance test (viz., One-way Anova) on the accuracies of the classifiers pairwise to know the level of significance (i.e., p-value) of the test for each (pairwise) comparison. The above steps have been described the proposed methodology for the classification.
Performance comparison with other rule mining algorithms. For the purpose of rule mining only, the whole post-discretized data matrix (i.e., ITb) is used directly as a input of the Bimax biclustering. In this case, we have not performed any cross-validation since there is no need to use classification. Hence, we have compared our proposed rule mining algorithms with the other existing popular rule mining algorithms (i.e., AprioriTid [21], Eclat [22], Tao et al. [23] and H-mine [24]). It should be noted that same input binary matrix (i.e., ITb) is utilized for the other rule mining algorithms.
Biological significance of evolved rules, and Biomarker identification. As all the real datasets are microarray/beadchip (biological) datasets, so the evolved top rules should have biological significance. The information about the relation between the genes and any disease can be determined from pathway and Gene Ontology (GO) analyses. If all the genes (except the class-label) of a rule occur together in any pathway/GO-term and if the occurrence is statistically significant (i.e., p-value is less than 0.05), then the rule becomes the (statistically) biologically significant rule. If the pathway/GO-term relates to the corresponding disease, then the rule becomes important for diagnosing the disease. Therefore, KEGG pathway and GO analyses have been performed on the genes of the evolved rules using David Database to identify top significant rules with their involved KEGG pathways or GO-terms. The top rules occupied in a significant number of pathways/GOs are obtained. Frequency of occurrence of the genes in the evolved rules for experimental/treated class is performed to identify potential biomarkers.

Results and Discussion
In this section, at first, we describe the real datasets that are utilized to verify the performance of our proposed method (i.e., StatBicRM). Thereafter, we have performed experiments on the real datasets as well as some artificial datasets. The artificial datasets are made by taking random boolean values. Hence, some related discussions are also included at the end of this section.

Real Datasets
We have used three real datasets. The datasets are described in Table 1.

Experimental Results and Discussion
As stated in section Materials and Methods, we have initially applied the statistical filtering strategy (viz., removal of genes having low variance, normalization, normality test, different parametric/non-parametric tests, consecutively) on the (three) real datasets. For the combined dataset (GSE31699), we have firstly identified 13072 common genes that have both expression data (in Dataset 2) as well as methylation data (in Dataset 3). Thereafter, we have found that 18228, 10236 and 8176 genes are following normal distribution, and 24222, 2836 and 4896 genes don't follow normal distribution for the three datasets, respectively. For DS1, we have identified a total of 1160 differentially expressed genes (viz., jTOTALDESET (N+NN) j = 1160) including jUPDESET N j = 344, jUPDESET NN j = 93, jDOWNDESET N j = 403 and jDOWNDESET NN j = 320 at 0.0001 p-value cutoff. Table 2 shows this. Similarly, for DS2, we have determined a total of 292 differentially expressed genes (viz., jTOTALDESET (N+NN) j = 292) including jUPDESET N j = 54, jUPDESET NN j = 86, jDOWNDESET N j = 82 and jDOWNDESET NN j = 70 at 0.05 p-value cutoff. Table 3 represents this. For DS3, we have identified a total of 536 differentially methylated genes (viz., jTOTALDMSET (N+NN) j = 536) including jHYPERDMSET N j = 74, jHYPERDMSET NN j = 181, jHYPODMSET N j = 118 and jHYPODMSET NN j = 163 at 0.05 p-value cutoff. Table 4 represents this. These genes (i.e., jTOTALDESET (N+NN) j or jTOTALDMSET (N+NN) j) are used in the next. Hence, clustergram of the differentially expressed genes for Dataset 1 is shown in Fig. 5. A volcanoplot which is used for identifying up-regulated genes as well as down-regulated genes using SAM for Dataset 1, is presented in Fig. 6. Furthermore, we have prepared an artificial large microarray dataset (denoted by "Dataset 4" or "DS4") having large number of experimental samples as well as large number of control samples (viz., 100 experimental samples and 100 control samples) and 30000 genes. We have included this dataset into our experiment for validating the performance of classifiers for large number of samples. We have applied our proposed method on the DS4 like DS1, DS2 and DS3. For the DS4, we have obtained that 5348 genes are following normal distribution, and remaining 24462 genes do not follow normal distribution. Thereafter, we have determined a total of 942 differentially expressed genes (viz., jTOTALDESET (N+NN) j = 942) including jUPDESET N j = 51, jUPDESET NN j = 532, jDOWNDESET N j = 23 and jDOWNDESET NN j = 336 at 0.05 p-value cutoff. Table 5 shows this.
As stated in section Materials and Methods, after statistical testing analysis, we have applied discretization, post-discretization, dividing the data into training and test sets using 4-fold CVs, finding MFCHOIs (i.e., maximal biclusters), and extracting the special rules, consecutively. E.g., a bicluster of a few DE up genes (viz., DISP1, DNALI1, RGN, RPL13P5, FBXO2 etc.) and DE down genes (viz., KIF11, CENPN, CENPW, DTL, UHRF1, CDCA4 etc.) across a set of two homogeneous conditions (i.e., AC-sample11/AC11 and AC-sample13/AC13) for DS1 is presented in Fig. 7 graphically. After extracting rules, ranking of them, weight assigning, majority voting and two-class classification are performed, consecutively (see section Materials and Methods). In our experiment, we have run 4-fold CVs for 10 times for each dataset. Thereafter, we have calculated average sensitivity, average specificity, average accuracy and average MCC, and finally compared these with the existing rule-based classifiers (presented in Table 6 for DS1, Table 7 for DS2, Table 8 for DS3, and Table 9 for DS4). For each of the four datasets, our proposed method provides the better average accuracy and average MCC than the other six classifiers. It provides 95%, 75.94%, 88.61% and 83.77% accuracies, and 0.88, 0.58, 0.77 and 0.64 MCCs for the four datasets, respectively. For DS1, it provides the best average sensitivity (viz., 99.25%) and average specificity (viz., 85.55%). For DS2, it shows the best average sensitivity (i.e., 98.13%), but its average specificity (i.e., 53.75%) is less than the other classifiers. Similarly, for DS3, it produces the best average sensitivity (i.e., 90.56%), but its average specificity (i.e., 86.67%) is less than some of the four classifiers (viz., PART, JRip, ConjunctiveRule and OneR). For DS4, it produces the best average sensitivity (i.e., 84.13%), but its average specificity (i.e., 83.62%) is less than the "PART" classifier. Fig. 8 shows the comparison of the dataset-wise average accuracies and average MCCs, respectively among our proposed method and the other rule-based classifiers for the four datasets. Thereafter, a statistical significance test (viz., Oneway Anova) is performed in between our method and each of the other classifiers pairwise. We have obtained significant p-values for all the pairwise comparisons (i.e., statistical tests) for DS1 and DS2, whereas we have identified that five among the six comparisons are statistically significant for DS3 as well as DS4 (see Fig. 9). The significance level (i.e., p-value) of each comparison is presented in Table 10. For DS1, we have obtained a set of rules, where some of these are of cancer subtype AC, and remaining of these are of cancer subtype SCC (i.e., these rules have 'class = AC'/'class = SCC' in their consequent parts). We have produced a list of top frequent genes occurring in the evolved rules of each cancer subtype. We have identified 'CENPA-' as top frequent gene for cancer subtype AC. But, 'CENPA-' is not found in any evolved rule of cancer subtype SCC. Therefore, it is important for cancer subtype AC. Some literature evidences have been found in [29,30]   Analyzing Large Gene Expression and Methylation Data Using StatBicRM Limma, 7.46E-07 in permuted t-test and 8.42E-06 in Wilcoxon ranksum test which indicate that it is a differentially expressed gene.) Besides that, we have also found literature-match of TTK in [31]. CENPN, KIF2C, EZH2 genes are also found in literatures [32,33] related to AC. Similarly, 'SHROOM3-' is the top frequent gene for SCC. In DS2, we have identified a set of rules, where some of them are of UL class (i.e., tumor class), and remaining rules of them are of MM class (i.e., normal class). We have found some literature documentations about PRL in [34], TRPC6 in [35], and IGF2 in [36]. Similarly, for DS3, we have identified some set of rules, where some of them are of UL class (i.e., tumor class), and remaining rules of them are of MM class (i.e., normal class). The top 10 frequent genes for the two classes for the datasets are shown in Table 11, respectively. Some evidence of PLP1 in forming the tumor in [37] and similar documentary support of TRPM2 in forming the tumor are found in [38]. In Fig. 10 depicts two examples of how significant biomarkers are identified for each class-label for each real dataset.     * This standard deviation of specificity is coming to be zero. On investigation, we have identified a particular datapoint belonging to normal class in Dataset 3 for which the "PART" classifier as well as the other classifiers including the proposed one are producing always false positive result.
doi:10.1371/journal.pone.0119448.t008 Table 9. Comparative performance analysis of the rule-based classifiers on Dataset 4, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column. Besides that, KEGG pathway and GO analysis have been performed on the genes of the evolved rules for the three real datasets using David Database. The significant KEGG pathways, GO:BP, GO:CC and GO:MF terms are presented in Table 12. Some top important rules w.r.t. their existing KEGG pathways, GO:BPs, GO:CCs and GO:MFs, individually are shown in Table 13, Table 14 and Table 15 for the three real datasets (i.e., DS1, DS2 and DS3), respectively. Here, an important rule w.r.t. their existing pathways/GO-terms refers to such a rule whose all the genes (i.e., all genes of consequent part) involve together in maximum number of pathways/GO-terms. The details about the significant KEGG pathways, GO:BPs, GO:CCs and GO: MFs are presented in S1 File, S2 File and S3 File for the three datasets, respectively. S2 Text shows the top 15 rules from the three real datasets.
Furthermore, we have compared our proposed rule mining algorithm with the existing ARM algorithms (viz., Traditional Apriori, AprioriTid, Eclat, Tao et al. and H-mine). But, since the number of genes of the datasets are high (i.e., greater than 250), and the other techniques generate frequent itemsets, therefore those methods fails to work on DS1 and DS4, and take long time to execute on DS2 and DS3 (i.e., nearly 5 hours or more) whereas our method can work efficiently on them as our method extracts maximal frequent closed homogeneous itemsets (MFCHOIs) using the biclustering technique. Therefore, we have made two artificial binary datasets which are prepared by taking random binary digits (namely 1 or 0). One dataset (denoted by ArDS5) among them has 100 genes and 60 samples (i.e., a total of 30 experimental samples and 30 control samples), and other dataset (denoted by ArDS6) includes 200 genes and 60 samples (i.e., a total of 30 experimental samples and 30 control samples). Thereafter, we have applied the binary matrix as input of the BiMax biclustering. For each of ArDS5 and ArDS6, we have observed that our proposed method produces much less number of significant non-redundant itemsets than the other rule mining algorithms as our proposed method determines the maximal frequent closed homogeneous itemsets (MFCHOIs) which are proper subsets of frequent itemsets (FIs); i.e., MFCHOI & FI. Furthermore, as a single rule is extracted from a single significant itemset (i.e., MFCHOI) in our method, thus the number of evolved rules are same with the number of significant itemsets (i.e., MFCHOIs). Therefore, the evolved rules in our method are much less in number than the other methods. Thus, elapsed time of our method is much less than the other methods; and it can work on big/medium data (i.e., the data having more than nearly 250 genes). Fig. 11 presents the comparison of the number of significant non-redundant itemsets among our proposed method (i.e., StatBicRM) and the other methods for details. S3 Text shows the comparative study about the number of evolved rules as well as total elapsed time among our proposed rule mining method and the other rule mining methods for the two artificial binary datasets (i.e., ArDS5 and ArDS6).   Table 11. Top 10 frequent genes in evolved rules of the two class-labels for DS1, DS2 and DS3, respectively. Rule experimental and Rule control denote the set of the evolved rules of experimental class-label, and the set of the evolved rules of control class-label, respectively.

DS1 DS2 DS3
For Rule experimental CENPA- dataset, we have identified that 10236 of the matched genes are normally distributed, and rest of them (i.e., 2836 genes) are not normally distributed. For the matched methylation dataset, we have found 8173 genes which are following normal distribution, and remaining 4899 genes that do not following normal distribution. Then, the four parametric tests are applied on the normally distributed genes, and the four non-parametric tests are applied on the non-normally distributed genes, both at 0.05 p-value threshold for the matched expression dataset as well as the matched methylation dataset. We have found 54 common up-regulated genes and 82 common down-regulated genes from the parametric tests, and 86 common up-regulated genes and 70 common down-regulated genes from the non-parametric tests for the expression dataset (viz., jUPDESET N j = 54, jUPDESET NN j = 86, jDOWNDESET N j = 82 and jDOWNDESET NN j = 70). Thereafter, we merge the common up-regulated and common down-regulated genes collected from the results of the parametric and nonparametric tests for the matched expression dataset (viz., jTOTALDESET (N+NN) j = 292). Similar statistical analyses are performed for the matched methylation dataset (viz., jHYPERDMSET N j = 89, jHYPERDMSET NN j = 174, jHYPODMSET N j = 95, jHYPODMSET NN j = 165, and jTOTALDMSET (N+NN) j = 523). Thereafter, our proposed method is applied on the TOTALDESET (N+NN) genes as well as the TOTALDMSET (N+NN) genes, individually (see S4 Text). Furthermore, we have concentrated on the internal relationship between the gene expression and methylation. As we know that the gene expression is inversely proportional to the methylation, therefore inversely correlated genes make sense to highlight the effect of methylation (i.e., epigenetic effect) on the expression level. For the combined dataset, six common    genes have been identified which are up-regulated as well as hypo-methylated. These genes are SEMA7A, FSD1, TUBB3, GLIS1, TDO2 and SHOX2. Subsequently, nine common genes are also detected that are both down-regulated and hyper-methylated. These genes are HOXB8, SLC25A18, CALCRL, TMEM71, C1orf115, EDG1, CCDC68, NUAK1 and CMTM8. These two types of genes are important for highlighting the effect of methylation (i.e., epigenetic effect) on the gene expression. However, the supplementary materials are available at: https:// www.dropbox.com/sh/gsvgty85jdlp2b6/13fOZxJV8n.

Conclusion
In this article, we have proposed a computational rule mining framework to determine special type of association rules and potential biomarkers using integrated approaches of statistical and the BiMax biclustering techniques from the gene expression as well as methylation data. At first, different statistical techniques (viz., removal of genes having low variance, normalization, normality test, different parametric and non-parametric tests) are utilized on the the whole dataset to obtain proper non-redundant/high-significant subset of differentially expressed/methylated genes. The resulting subset of genes are used in next step. Thereafter, the data is discretized and post-discretized, consecutively. The biclustering technique is then utilized to determine MFCHOIs. Thereafter, the special rules are generated from the MFCHOIs. Our proposed rule mining method performs much better than the state-of-the-art rule mining algorithms as it generates MFCHOIs instead of FIs. Therefore, it saves running time, and it can able to work on the big dataset. Pathway and GO analyses have been performed on the genes of the evolved rules by David software. Occurrence of each gene in the evolved rules of each classlabel is determined for identifying the potential biomarkers. Furthermore, we have also made classification the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we have also compared the average classification accuracy, and other related factors of our proposed method with the other existing rule-based classifiers. Statistical significance tests are also utilized for checking the statistical relevance of the comparisons. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. At the end, we have also performed the integrated analysis of the gene expression data and the methylation data for highlighting the epigenetic effect (viz., the effect of methylation) on the gene expression level.