Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

Analyzing Large Gene Expression and Methylation Data Profiles Using StatBicRM: Statistical Biclustering-Based Rule Mining

  • Ujjwal Maulik, 
  • Saurav Mallik, 
  • Anirban Mukhopadhyay, 
  • Sanghamitra Bandyopadhyay
PLOS
x

Abstract

Microarray and beadchip are two most efficient techniques for measuring gene expression and methylation data in bioinformatics. Biclustering deals with the simultaneous clustering of genes and samples. In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special type of rules and potential biomarkers using integrated approaches of statistical and binary inclusion-maximal biclustering techniques from the biological datasets. At first, a novel statistical strategy has been utilized to eliminate the insignificant/low-significant/redundant genes in such way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). The data is then discretized and post-discretized, consecutively. Thereafter, the biclustering technique is applied to identify maximal frequent closed homogeneous itemsets. Corresponding special type of rules are then extracted from the selected itemsets. Our proposed rule mining method performs better than the other rule mining algorithms as it generates maximal frequent closed homogeneous itemsets instead of frequent itemsets. Thus, it saves elapsed time, and can work on big dataset. Pathway and Gene Ontology analyses are conducted on the genes of the evolved rules using David database. Frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers. Furthermore, we also classify the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we also compare the average classification accuracy, and other related factors with other rule-based classifiers. Statistical significance tests are also performed for verifying the statistical relevance of the comparative results. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. Finally, we have also included the integrated analysis of gene expression and methylation for determining epigenetic effect (viz., effect of methylation) on gene expression level.

Introduction

Microarray technique is a useful tool for measuring gene expression data across different experimental and control samples. Similarly, beadchip is another efficient technique for generating genome-wide DNA methylation profiling in infinium II platform. DNA methylation is an important epigenetic factor that refers to the addition of a methyl group (-CH3) to position 5 of the cytosine pyrimidine ring or the number 6 nitrogen of the adenine purine ring in genomic DNA. It modifies, in general decreases, the expression levels of genes. Both the expression and methylation data matrix [1], [2], [3], [4] are initially organized in such a way that rows and columns indicate genes and samples (conditions), respectively. Statistical analysis [5], [6], [7] is an important tool to identify differential expression/methylation (i.e., DE/DM) genes across different types of samples.

Association rule mining (ARM) [8], [9] is another useful tool for determining interesting (expression/methylation) relationships among items (genes) under different conditions (samples). In this article, we propose a computational rule mining framework, StatBicRM (i.e., statistical biclustering-based rule mining) to identify special rules of genes and potential biomarkers from the large gene expression and/or methylation data by integrating a novel statistical technique and binary inclusion-maximal biclustering technique, consecutively.

In traditional association rule mining algorithms, huge number of rules is coming out as result. Thus, it is difficult to run them on medium or large sized dataset in which the number of genes is approximately 250 or more. To solve the problem, in our proposed method, we have utilized the binary inclusion-maximal biclustering (i.e., BiMax) technique [10] for mining non-redundant significant itemsets and corresponding special rules. But, the biclustering technique can work on such dataset whose the number of genes is less than equal to 10,000 approximately. If the number is greater than 10,000, it fails to work. Thus, for such large dataset, we have to apply a statistical strategy on the dataset before using the biclustering technique to eliminate the redundant/insignificant/low-significant genes in such way that significance level must rely on the data distribution property (viz., either normal distribution or non-normal distribution). Therefore, first of all, the whole data is passed through different fundamental statistical techniques (viz., removal of genes having low variance and normalization, consecutively).

Now, if there is large number of samples in a dataset, there is no need to use any normality test on the data before using any statistical test as all statistical tests perform more or less well for the large number of samples. But, if a dataset has small number of samples, then it has been observed that different statistical tests perform differently [11]. It is well-known that standard t-test, Welch’s t-test, Bayes t-test and Pearson’s correlation test are all parametric statistical tests, and Limma, significant analysis of microarrays (SAM), Wilcoxon’s ranksum test and permuted t-test are considered as non-parametric tests. Some non-parametric statistical tests (like Limma and SAM) are good performers for normally distributed data as well as non-normally distributed data in all conditions, specially for small sample sizes. But, the performance of SAM is found to be inconsistent as sometimes it produces good performance while at other times it fails to work properly for small sample sizes. The performance of permuted t-test is satisfactory in case of non-normal distributions for all types of sample sizes. But, in case of normal distributions, it works poorly especially for small sample sizes. For normally distributed data, the performance of Wilcoxon’s ranksum test is much poorer than standard t-test for small sample. On the other hand, for small sample sizes, the standard t-test produces poor performance for non-normally distributed data where its performance is better in normally distributed data. Performance of Welch’s t-test is poor for both the cases of data distributions for small number of samples. To summarize, it can be stated that in case of small number of samples, it is better to test the data distribution in advance [11]. Otherwise p-values may be misleading due to the assumption of incorrect distribution. Therefore, we have initially used a well-known normality test (i.e., Jarque-Bera test [12]) for testing the distribution pattern of each data whether the data is normally distributed or not. Depending on the patterns, the dataset is then partitioned into two sub-datasets, where one sub-dataset has all normally distributed data, and remaining one contains all non-normally distributed data. Now, it is noticed that the parametric tests perform better for normally distributed data than for non-normally distributed data on average. On the other hand, the performance of non-parametric tests is more satisfactory for non-normally distributed data than for normally distributed data on average [11]. Therefore, after testing for normality, we have run multiple parametric statistical tests (viz., t-test [11], Welch’s t-test [11], modified Bayes’ t-test by Fox and Dimmic [13], and Pearson’s correlation test (Corr) [11]) on the normally distributed data to identify differentially expressed/methylated genes and taken their intersection in order to be certain that whichever genes are identified, are truly differentially expressed/methylated. Similarly, we have applied multiple non-parametric tests (viz., Limma [11], significant analysis of microarrays (SAM) [11], Wilcoxon’s ranksum test (Wcox) [11], and permuted t-test (Perm) [11]) on non-normally distributed data to obtain differentially expressed/methylated genes, and taken their intersection in order to be certain that whichever genes are identified, are truly differentially expressed/methylated. A list is then prepared containing the resulting intersected genes from both the normally distributed dataset and the non-normally distributed dataset. These statistical methods are utilized to determine the proper significant non-redundant subset of the differentially expressed/methylated genes from the original large dataset. Thereafter, discretization and post-discretization are utilized consecutively on the subset of data for converting it into corresponding boolean matrix.

Now, our next major goal is rule mining. For this purpose, the biclustering technique is directly applied on the post-discretized data-matrix for determining maximal homogeneous biclusters of genes as maximal frequent closed homogeneous itemsets (viz., MFCHOIs) at a minimum support-value. Here, MFCHOI means the maximal biclusters that have sets of all homogeneous class-labels/samples. The rules are then extracted from the MFCHOIs. Each evolved rule is of special type, i.e., consequent of the rule consists of its class-label only. Therefore, each MFCHOI produces a single special rule. Our proposed methodology performs better than state-of-the-art rule mining algorithms as it generates maximal frequent closed homogeneous itemsets (viz., MFCHOIs) instead of frequent itemsets. Each of the other rule mining methods is also starting with the same post-discretized data-matrix. Another advantage of it that as these rules are classification rules, so we do not need to calculate any other rule-interestingness measure (e.g. confidence) except support. Therefore, it saves elapsed time and can work on big data in which number of genes is high. Pathway and Gene Ontology (GO) analysis are conducted on the genes of the evolved rules using David database. Furthermore, frequency analysis of the genes appearing in the evolved rules is performed to determine potential biomarkers.

Furthermore, it is also needed to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. For this, we need to perform cross-validation and classification, consecutively on the data to compute average accuracy of the proposed method. Therefore, the earlier mentioned post-discretized data-matrix is divided into training and test sets using 4-fold cross-validations (CVs). Thereafter, the biclustering technique is applied on the training part of the dataset for determining MFCHOIs at a minimum support-value. The special rules are then extracted from the MFCHOIs. Here, each MFCHOI generates a single rule. We have also estimated 23 rule-interestingness measures [14], [15] of the evolved rules. We have also added another new measure (viz., the number of satisfiable conditions/samples of the corresponding bicluster for each evolved rule). We have estimated the rank of each rule according each of the 24 measures individually using Fractional ranking [1618]. The final ranking of each evolved rule is calculated by average ranking on the resulting fractional rankings of the rule. All the rules are rearranged from best to worst case. We have then assigned some weight on the final list of rules in such a way that the topmost rule gets the highest weight, 2nd topper gets 2nd highest weight and so on; and also the weight-interval between any two consecutive ranked rules is same. The classification technique is applied on each test data point using a majority voting technique through weighted-sum method. A comparative performance study with existing popular rule-based classifiers is conducted based on the average classification accuracy, MCC and related factors of them. Our classification method provides better performance than the existing popular rule-based classifiers. Here, each of the other rule-based classifiers is also starting with the same post-discretized data-matrix. Statistical significance tests (viz., one-way ANOVA) [19] are also performed for verifying the statistically relevance of the comparative results.

As we have mentioned earlier that DNA methylation is one of the important epigenetic factors which can change (generally decrease) the expression levels of genes, therefore we have also performed integrative analysis of gene expression dataset and methylation dataset of combined dataset. As we know that the gene expression is inversely proportional to the methylation, so inversely correlated genes make sense to highlight the epigenetic effect (e.g., methylation) on the expression level. Therefore, we have identified these type of genes having inverse relationship between their methylation and expression levels.

The rest of the article is organized as follows. In Section Materials and Methods, literature review and our proposed methodology have been elaborated. Section Results and Discussion presents source and brief description about the real datasets, and the experimental results and discussion. Finally, Section Conclusion concludes the article.

Materials and Methods

Literature Review

Association rule mining (ARM) [8], [9] is one of the useful tools for determining interesting (expression/methylation) relationships among items (genes) under different conditions (samples). It can provide association rules based on frequent itemsets. A rule (R) can be described as AC, where A, CIM and AC = ϕ. Here, A and C are called as antecedent (i.e., set of items in LHS of a rule) and consequent (i.e., set of items in RHS of a rule), respectively. The support of the itemset (IM) is defined as number of transactions in which all items of it appear together. IM is frequent when its support is greater than any threshold value (i.e., minimum support). The confidence of the rule is defined as ratio of support of IM to the support of A. Frequent closed itemset (FCI) is a condensed form of frequent itemsets. FCI is used to avoid redundancy.

In past decades, traditional Apriori algorithm [20] was most fundamental association rule mining technique. Apriori uses a bottom-up technique in which frequent subsets are extended one item at a time for determining each candidate itemset. Groups of the candidate itemsets are then tested in the data. The method terminates if no further successful extension is found. The result of Apriori is the sets of rules which determine the occurrence of items in the dataset. Apriori follows breadth-first search for counting the candidate itemsets. Apriori generates candidate itemsets having length k from the itemsets having length k − 1. It discards infrequent candidate itemsets. The set of candidate itemsets have all frequent itemsets. After extracting all frequent itemsets, corresponding set of rules is mined from each frequent itemset. As Apriori generates only frequent itemsets, thus huge number of rules are produced from the itemsets. Therefore, Apriori can not run on medium or large size of data. It can hardly work up to 100 genes, approximately. But, if there is more than 100 genes, then it either takes a long time or fails to run.

After further investigations, different shortcomings have been identified in the traditional Apriori, like production of high number of frequent itemsets, high running time, problem of multiple-scan of the dataset etc. Many other ARM techniques have been proposed (e.g., AprioriTid [21], Eclat [22], Tao et al. [23], H-mine [24] etc.) to reduce these shortcomings. But, for medium or large sized dataset (i.e., whose the number of genes is greater than 250 approximately), either the methods fails to work on the dataset or they take a long time (viz., approximately 5 hours or more).

For solving the above limitation, in this article, we have used the BiMax biclustering technique [10] for extracting maximal frequent closed homogeneous itemsets (MFCHOIs) and corresponding special rules. It is a method for identifying groups of all-1 biclusters from a boolean data matrix under certain conditions. The aim of the biclustering is to discover groups of genes (i.e., all-1 biclusters) having similar behaviour under a subset of conditions (samples). The biclustering technique extracts the maximal frequent closed homogeneous itemsets (viz., MFCHOIs) which are proper subsets of frequent itemsets (FIs); i.e., MFCHOIFI. Thus, Our proposed method produces much less number of significant non-redundant itemsets than the other rule mining algorithms. But, the biclustering technique can work on the dataset in which the number of genes is less than equal to 10,000 approximately. If the number is greater than 10,000, it can not work on the dataset. Therefore, for the large dataset, we need to utilize some statistical strategy on the dataset before applying the biclustering technique for eliminating the redundant/insignificant/low-significant genes in such a way that significance level must satisfy the data distribution property (viz., either normal distribution or non-normal distribution). Hence, here, we have proposed a computational rule mining framework, StatBicRM for producing special rules of genes, and potential biomarkers from the large gene expression and/or methylation dataset by integrating a novel statistical technique and the biclustering technique, consecutively.

Proposed Method

Our proposed technique, StatBicRM is basically a computational framework for rule mining where integrated approach of statistical and binary inclusion-maximal biclustering techniques are utilized in gene expression or methylation dataset (see Fig. 1). Besides this, we have also performed classification using the proposed method to know how much the evolved rules are able to describe accurately the remaining test (unknown) data (see Fig. 2).

thumbnail
Fig 1. Flowchart of the proposed methodology (StatBicRM) for the rule mining.

Here, the terms TOTALDESETN, TOTALDESETNN, TOTALDESETN+NN are described in last paragraph of subsection “Identification of differentially expressed/methylated genes using Statistical tests”. For methylation dataset, the above terms are replaced by TOTALDMSETN, TOTALDMSETNN, TOTALDMSETN+NN, respectively.

https://doi.org/10.1371/journal.pone.0119448.g001

thumbnail
Fig 2. Flowchart of the proposed methodology (StatBicRM) for the classification.

Here, the terms TOTALDESETN, TOTALDESETNN, TOTALDESETN+NN are described in last paragraph of subsection. For methylation dataset, the above terms are replaced by TOTALDMSETN, TOTALDMSETNN, TOTALDMSETN+NN, respectively.

https://doi.org/10.1371/journal.pone.0119448.g002

The steps of StatBicRM is described briefly in the following steps:

Identification of differentially expressed/methylated genes using Statistical tests.

Our proposed method basically depends on statistical analysis. As we know that in case of big gene expression/methylation dataset, there may exist 10,000 or more genes. Among them, most of the genes are non-differentially expressed/methylated (i.e., nDE/nDM), and only some of them are differentially expressed/methylated (i.e., DE/DM). When a rule is generated, then these two types of genes may occur together in the rule. According to biological scenario, DE/DM genes can only make sense in a rule relating to specific disease, where the other type of genes is irrelevant to the disease. Therefore, we have initially used a novel statistical strategy on the dataset to identify the set of statistically significant non-redundant DE/DM genes in such way that significance level must rely on the data distribution property (viz., either normal distribution or non-normal distribution). For doing this, at first, the genes which have low variance are eliminated from the gene expression/methylation dataset. Thereafter, we have used zero-mean normalization on the data of these genes to adjust the values measured on different scales to a common scale. The zero-mean normalization can be stated as: (1) where μ and σ denote mean and standard deviation of the expression/methylation data of a gene i before normalization respectively; and vij and vij refer to the value of i-th gene at j-th condition before and after normalization, respectively.

It is well known that the parametric statistical tests [25] are appropriate for normally distributed data, and non-parametric statistical tests [25] are appropriate for non-normally distributed data, respectively. Therefore, Jarque-Bera normality test [12], [26] is utilized on the normalized data to determine the pattern of distribution of the data whether it is normally distributed or non-normally distributed. The Jarque-Bera normality test is defined as follows: (2) where d denotes the degree of freedom, S is the skewness of the sample, and K refers to the kurtosis of the sample. Hence, depending on the resulting distribution patterns, the whole normalized dataset is partitioned into two sub-datasets, where one sub-dataset has all normally distributed data, and remaining one contains all non-normally distributed data.

Thereafter, we have applied four parametric statistical tests (viz., t-test [11], Welch’s t-test [11], modified Bayes’ t-test by Fox and Dimmic [13] in 2006 and Pearson’s correlation test (Corr) [11]) on the normally distributed data to obtain differentially expressed genes for the normally distributed sub-dataset. Similarly, four non-parametric tests (viz., Limma [11], Significant analysis of microarrays (SAM) [11], Wilcoxon ranksum test (Wcox) [11] and permute t-test (Perm) [11]) are applied on the non-normally distributed data to obtain differentially expressed genes for the non-normally distributed sub-dataset.

Before further proceeding, we have shortly discussed in the followings about some of the statistical tests mentioned above.

The “2-sample t-test” makes comparison between means of the two groups with the variation in the data. From the test statistic, we compute a measure (i.e., p-value). The p-value indicates the probability of observing a t-value as large or larger than the actually observed t-value where the null hypothesis is given true. By convention, if the p-value of a gene (item) is less than 5%, then the gene is statistically called as differentially expressed/methylated gene. Now, suppose, for each gene g, group 1: n1 treated samples, with mean x1g and standard deviation s1g; and group 2: n1 controlled samples, with mean x2g and standard deviation s2g. (3) Here, seg denotes the standard error of the groups’ mean, thus, (4) where sPooled is the pooled estimate of the population standard deviation; i.e., (5) Here, df is degree of freedom of the test. It is stated as df = (n1 + n2 − 2). This strategy is used assuming that variance of two groups are equal.

For Welch’s t-test, the variance of two groups are checked whether they are equal to each other or not. If equal, then use earlier mentioned t-statistic in Equation 3, otherwise use the following t-statistic: (6) Here we use unpooled estimates of the population standard deviations.

Pearson’s correlation coefficient (commonly denoted as ρ) between two variables is described as the covariance of the two variables divided by the product of their standard deviations, i.e., (7) where (8) where samplesize for the two groups are n1 and n2, respectively (here, n1 = n2). This test can predict whether two variables are related or not.

The moderated t-statistic in Limma [27] can be demonstrated as: (9) where samplesize n = n1 + n2, β^g and s˜g2 denote the contrast estimator and posterior sample variance for the gene g respectively. The statistic for calculating contrast estimator for gene g is: (10) where, N is normal distribution, and the statistic for estimating posterior sample variance for the gene g is: (11) Where, d0 (< ∞) and s02 refer to the prior degrees of freedom and variance respectively, and dg (> 0) and sg2 denote the experimental degrees of freedom and the sample variance of a particular gene g, respectively.

SAM chooses to add a small positive constant s0 (stated as “fudge factor”) to solve small variance problem. The SAM statistic by Tusher et al.(2001) is: (12) where seg is the standard error of the groups’ mean (see Equation 4). Here, sPooled is the pooled estimate of the population standard deviation (see Equation 5). Here, df is degree of freedom of the test. It is stated as df = (n1 + n2 − 2).

In Wilcoxon ranksum test, a list of ranks of the gene expression values for each gene is prepared in ascending for each group, and then tests for equality of means of the two ranked samples. The z-statistic of the test is: (13) where (14) (15) and (16)

A permuted t-test is a kind of t-test in which an rearrangement is conducted in the labels on the observed data-points of each gene (item).

However, as stated earlier that the four parametric statistical tests are applied on the normally distributed dataset, thus different number of up-regulated and down-regulated genes are coming out from the different parametric tests. Thereafter, we have performed intersection of the up-regulated genes to identify set of common up-regulated genes (denoted by UPDESETN) for the normally distributed sub-dataset. Similarly, we have got set of common down-regulated genes (denoted by DOWNDESETN). We have then made a list (denoted by TOTALDESETN) containing all the common up-regulated genes and all the common down-regulated genes; i.e., TOTALDESETN = UPDESETN + DOWNDESETN. Similarly, as stated earlier that the four non-parametric statistical tests are applied on the non-normally distributed dataset, thus different number of up-regulated and down-regulated genes are coming out from the different non-parametric tests. Then, we have made intersection of the up-regulated genes to identify set of common up-regulated genes (denoted by UPDESETNN) for the non-normally distributed sub-dataset. Similarly, we have got set of common down-regulated genes (denoted by DOWNDESETNN). We have then made another list (denoted by TOTALDESETNN) containing all the common up-regulated genes and all the common down-regulated genes; i.e., TOTALDESETNN = UPDESETNN + DOWNDESETNN.

Finally, we have produced a final list (denoted by TOTALDESET(N+NN)) containing all the common up-regulated and down-regulated genes from the normally distributed and non-normally distributed datasets; i.e., TOTALDESET(N+NN) = TOTALDESETN + TOTALDESETNN. Hence, the final list of genes (i.e., TOTALDESET(N+NN)) are utilized in the next step. Similar steps are performed to obtain HYPERDMSETN, HYPODMSETN, TOTALDMSETN, HYPERDMSETNN, HYPODMSETNN, TOTALDMSETNN and TOTALDMSET(N+NN) instead of UPDESETN, DOWNDESETN, TOTALDESETN, UPDESETNN, DOWNDESETNN, TOTALDESETNN, TOTALDESET(N+NN) respectively.

Discretization and Post-discretization.

Suppose, the data matrix of the resulting list of genes, TOTALDESET(N+NN) is denoted by I. Now, first of all, I whose rows denote genes and columns denote samples, is transposed. Suppose, PIT is the transposed matrix. As the PIT matrix is already normalized by zero-mean normalization, therefore the following step is utilized for binary discretization of the matrix: (17) where PITij denotes the expression/methylation value of i-th row and j-th column (1 ≤ im, 1 ≤ jn), m and n are number of rows (samples) and number of columns (genes) of PITij matrix, respectively, and IT is the resulting discretized matrix. Now, let us assume that in the discretized boolean matrix, a up-regulated gene and a down-regulated gene are denoted by DEup and DEdown, respectively. In the matrix IT, ‘1’ and ‘0’ refer to presence of up-regulated gene (DEup), and presence of down-regulated gene (DEdown), respectively (see part (b) of Fig. 3). After discretization, we will apply Bimax biclustering for finding all-1 biclusters. As the Bimax biclustering rectifies only ‘1’, not ‘0’, thus we need to do post-discretization in such way where ‘1’ will represent both DEup and DEdown properties. Therefore, after discretization, number of columns is doubled where the first half is a domain for DEup property, and remaining half is another domain for DEdown property (see part (c) of Fig. 3). E.g., the column denoted by g1 in part (b) of Fig. 3 is divided into the two columns denoted by g1+ and g1− in part (c) of Fig. 3, where for the g1+ column, ‘1’ denotes presence of up-regulated gene (DEup) and ‘0’ denotes absence of up-regulated gene (∼ DEup), and for the g1− column, ‘1’ denotes presence of down-regulated gene (DEdown) and ‘0’ denotes absence of down-regulated gene (∼ DEdown). Hence, for methylation data, DMhyper and DMhypo are used instead of DEup and DEdown, respectively. Note that in this paper, ‘+’ and ‘-’ denote up-regulation/hyper-methylation and down-regulation/hypo-methylation, respectively.

thumbnail
Fig 3. An example of generating special rules from data matrix of the differentially expressed genes.

Here, up-regulation (i.e., ‘+’) and down-regulation (‘-’) are denoted by ‘1’ and ‘0’ in (b), and red and green colors in (c), respectively. Here, str and snr denote experimental/diseased/treated and control/normal samples respectively.

https://doi.org/10.1371/journal.pone.0119448.g003

Dividing whole data into training and test sets.

Let us assume that the post-discretized matrix is denoted by ITb. For classification of the matrix ITb, we have applied 4-fold cross-validations (CVs) on the matrix to divide it into test and training data, where one-fold of ITb will be used as test set and remaining three fold will be considered as training set. This procedure will be repeated for four times as it is 4-fold CV.

Finding maximal biclusters and extracting special rules.

We have transposed the training boolean dataset, and applied Bimax biclustering to identify maximal frequent closed homogeneous itemsets (MFCHOIs). Before further proceeding, the fundamental method of BiMax biclustering is discussed in short.

Suppose, a boolean matrix e has size of n × m, where n is number of genes and m is the number of samples. A cell eij is 1 if gene i expresses differentially in the sample/condition j and otherwise, eij is 0. A bicluster (G, S) is a subset of genes G ⊆ {1, 2, …, n} which express differently together under a subset of samples S ⊆ {1, 2, …, m}; i.e., the pair (G, S) refers to a subset of the matrix e whose all elements have 1. The biclusters which are inclusion-maximal (i.e., the biclusters that are not entirely part of any other bicluster), are only interesting. The pair (G, S) ∈ 2{1,2,…,n} × 2{1,2,…,m} can be stated as a bicluster of the type inclusion-maximal [10] if and only if (i) eij = 1, ∀iεG, jεS, and (ii) ∄ (G′, S′) ∈ 2{1,2,…,n} × 2{1,2,…,m} with (a) eij = 1, ∀i′ ∈ G′, j′ ∈ S′ and (b) GG′∧SS′∧(G′, S′) ≠ (G, S). When there is no proper superset of an itemset have been found at the same support value, then the itemset is called closed itemset. Finding the set of frequent itemsets is totally equivalent to get a set of all-1 biclusters each having at least number of conditions/samples (i.e., satisfying minimum support).

In our experiment, for the biclustering, we have set a fixed minimum cutoff of items/genes (viz., 2), and different minimum cutoffs of sample/condition for determining itemsets at different minimum support of each rule. The BiMax biclustering can generate all maximal biclusters. The items/genes of maximal biclusters represent a (maximal) closed itemset. Thus, all extracted biclusters that are satisfying minimum support condition, produce the set of (maximal) frequent closed itemsets with their class-labels (i.e., conditions/types of samples). Thereby, we have to filter the biclusters depending on their conditions. We have selected such (maximal) biclusters which have the group of homogeneous (non-contradictory) conditions. In other words, we have to identify maximal frequent closed homogeneous itemsets (MFCHOIs). E.g., Bicluster 1 is a MFCHOI which has three genes g1+, g2− and g3+, and two homogeneous conditions/samples str1 and str3 (presented in part (d) of Fig. 3). Similarly, Bicluster 2 is another MFCHOI which has three genes g1+, g2+ and g3−, and two homogeneous conditions/samples snr1 and snr3 (presented in part (e) of Fig. 3). Hence, we have omitted the biclusters that have the group of heterogeneous (contradictory) conditions. E.g., Bicluster 3 is such type of heterogeneous (contradictory) bicluster which has three genes g1−, g2+ and g3+, and two heterogeneous conditions str2 and snr2 (presented in part (f) of Fig. 3).

From each selected bicluster of genes, we can extract an association rule. Each resulting rule must be of special type, i.e., consequent of the rule consists of its class-label (i.e., either treated/diseased/experimental class-label or normal/control class-label) only. E.g., from the Bicluster 1 (depicted in part (d) of Fig. 3), rule id 1 (i.e., {g1+, g2−, g3+} ⇒ disease) is produced. It states that if both of gene1 and gene3 are up-regulated/hyper-methylated and gene2 is down-regulated/hypo-methylated simultaneously, then ‘disease’ occurs (see part (d) and part (g).(i) of Fig. 3). Similarly, rule id 2 (i.e., {g1+, g2+, g3−} ⇒ normal) is generated from the Bicluster 2 (see part (e) and part (g).(ii) of Fig. 3).

Ranking of rules.

We have evaluated each evolved rule based on 24 rule-interestingness measures. Support, confidence, coverage, prevalance, sensitivity (or, recall), specificity, accuracy, lift (or, interest), leverage, added value, relative risk, Jaccard, Yules’ Q, klosgen, Laplace correction, Gini index, two-way support, linear correlation coefficient (or, ϕ-coefficient), cosine, least contradiction, Zhang, liverage2 (or, Piatetsky-Shapiro) and kappa [14, 15] are already included among the 24 measures (see S1 Text). The last and novel measure is the number of satisfiable conditions/samples to each evolved rule. E.g., according to part (g).(i) of Fig. 3, the value of the measure of rule id 1 (i.e., {g1+, g2−, g3+} ⇒ disease) is 2 as its corresponding bicluster (in part (d) of Fig. 3) has two conditions (str1 and str3). Similarly, according to part (g).(ii) of Fig. 3, the value of the measure of rule id 2 (i.e., {g1+, g2+, g3−} ⇒ normal) is 2. The rank of the rule is proportional to the value of the measure of it (i.e., if a rule that has higher value of the measure than other rule, then the rank of the rule will be better than the second rule).

Thereafter, the evolved rules are ranked according to each of the 24 rule-interestingness measures individually using Fractional ranking [1618]. In the fraction ranking, items which compare equal, hold the same rank. This rank is the mean of ranking numbers that are received in ordinal ranking. E.g., suppose, a data set is {1 2 2}. Here, only two different numbers are available, so there should be two different ranks. If 2 and 2 are actually different numbers, then they should hold ranks 2 and 3, respectively. As these two numbers are same, thus we should calculate their rank by making the average of their ranks as follows: (2+3)/2 = 2.5; therefore, the fractional ranks will be: 1 2.5 2.5.

Hence, the final ranking of each rule is determined by average ranking on the resulting fractional rankings of the rule. The rules are then rearranged in ascending order (i.e., from best to worst rank).

Assigning weights to the rules w.r.t. their final ranking.

For classification, we have to apply a majority voting technique on each test data point to identify its class-label through weighted-sum method. Thus, we firstly assign some weight on the final list of rules in such a way that the topmost rule gets the highest weight, 2nd topper gets 2nd highest weight and so on. The weight of the first ranked rule is always 1. The ranges of weight lie in between 0 and 1. The weight of each rule (denoted by wj, 1 ≤ jp) is estimated from a function of its final rank of the rule (denoted by rj) and the total number of rules (viz., p) as described below: (18) Here, the weight-interval between any two consecutive ranked rules is same. Thus, the calculated weights of the rules are normalized using zero-mean normalization (in Equation 1).

Majority voting and classification.

Consider one test data point (ts). For determining of the predicted class-label of it, we have applied ‘majority voting’ technique. At first, we have identified the rules whose all the items (genes) in their antecedent sides exist in ts. The weights of the rules (i.e., trRts number of rules) which have only the class-label ‘disease’ in their consequent sides are then summed up (viz., Ws_trts). Similar summation (viz., Ws_nrts) is performed for the rules (i.e., nrRts number of rules) having only the class-label ‘normal’ in their consequent sides. The two weighted-sum are then compared, and the class-label with higher weighted-sum becomes the predicted class-label of ts (viz., PredClsts). But, if both the weighted-sum are equal to each other, then the class-label of the top rule which satisfies ts (i.e., ClsTopRts), becomes the predicted class-label of it. In case, if there is no such rule which satisfies that ts, then the class-label of the rule which satisfies maximum number of test points (i.e., ClsR) becomes the predicted class-label of it. An example of majority voting technique is presented in Fig. 4. Repeat this step for other test points. This process is then repeated 4 times for 4 sub-matrices of the test data as here 4-fold CV is used. Using this technique, we have calculated true positive (TP), true negative (TN), false positive (FP), false negative (FN), sensitivity [7], specificity [7], accuracy [7] and Mathews correlation coefficient (MCC) [7] for the proposed classification. Sensitivity, specificity, accuracy and MCC are defined in the followings, respectively: (19) (20) (21) (22) we have repeated the 4-fold CV for 10 times, and then their average sensitivity, average specificity, average accuracy, and average MCC are calculated with the standard deviations based on the results of cross-validations. Thereafter, a comparative performance analysis has been conducted between our proposed method (i.e., Prop) and other popular rule-based classifiers (i.e., ConjunctiveRule (CJR) [28], DecisionTable (DT) [28], JRip [28], OneR [28], PART [28] and Ridor [28] implemented in Weka 3.6 software) based on their average sensitivity, average specificity, average accuracy, and average MCC. Note that the other rule-based classifiers are also started with the same post-discretized matrix (i.e., ITb). We have also performed significance test (viz., One-way Anova) on the accuracies of the classifiers pairwise to know the level of significance (i.e., p-value) of the test for each (pairwise) comparison.

thumbnail
Fig 4. An example of classification of evolved rules by the majority voting using weighted-sum.

Here, ‘r’ and ‘w’ denote rank and weight of the rule (computed by Equation 18), respectively. Tickmark/crossmark in ‘Q’ column states that test-point (ts) is satisfied/non-satisfied by the corresponding rule.

https://doi.org/10.1371/journal.pone.0119448.g004

The above steps have been described the proposed methodology for the classification.

Performance comparison with other rule mining algorithms.

For the purpose of rule mining only, the whole post-discretized data matrix (i.e., ITb) is used directly as a input of the Bimax biclustering. In this case, we have not performed any cross-validation since there is no need to use classification. Hence, we have compared our proposed rule mining algorithms with the other existing popular rule mining algorithms (i.e., AprioriTid [21], Eclat [22], Tao et al. [23] and H-mine [24]). It should be noted that same input binary matrix (i.e., ITb) is utilized for the other rule mining algorithms.

Biological significance of evolved rules, and Biomarker identification.

As all the real datasets are microarray/beadchip (biological) datasets, so the evolved top rules should have biological significance. The information about the relation between the genes and any disease can be determined from pathway and Gene Ontology (GO) analyses. If all the genes (except the class-label) of a rule occur together in any pathway/GO-term and if the occurrence is statistically significant (i.e., p-value is less than 0.05), then the rule becomes the (statistically) biologically significant rule. If the pathway/GO-term relates to the corresponding disease, then the rule becomes important for diagnosing the disease. Therefore, KEGG pathway and GO analyses have been performed on the genes of the evolved rules using David Database to identify top significant rules with their involved KEGG pathways or GO-terms. The top rules occupied in a significant number of pathways/GOs are obtained. Frequency of occurrence of the genes in the evolved rules for experimental/treated class is performed to identify potential biomarkers.

Results and Discussion

In this section, at first, we describe the real datasets that are utilized to verify the performance of our proposed method (i.e., StatBicRM). Thereafter, we have performed experiments on the real datasets as well as some artificial datasets. The artificial datasets are made by taking random boolean values. Hence, some related discussions are also included at the end of this section.

Real Datasets

We have used three real datasets. The datasets are described in Table 1.

Experimental Results and Discussion

As stated in section Materials and Methods, we have initially applied the statistical filtering strategy (viz., removal of genes having low variance, normalization, normality test, different parametric/non-parametric tests, consecutively) on the (three) real datasets. For the combined dataset (GSE31699), we have firstly identified 13072 common genes that have both expression data (in Dataset 2) as well as methylation data (in Dataset 3). Thereafter, we have found that 18228, 10236 and 8176 genes are following normal distribution, and 24222, 2836 and 4896 genes don’t follow normal distribution for the three datasets, respectively. For DS1, we have identified a total of 1160 differentially expressed genes (viz., ∣TOTALDESET(N+NN)∣ = 1160) including ∣UPDESETN∣ = 344, ∣UPDESETNN∣ = 93, ∣DOWNDESETN∣ = 403 and ∣DOWNDESETNN∣ = 320 at 0.0001 p-value cutoff. Table 2 shows this. Similarly, for DS2, we have determined a total of 292 differentially expressed genes (viz., ∣TOTALDESET(N+NN)∣ = 292) including ∣UPDESETN∣ = 54, ∣UPDESETNN∣ = 86, ∣DOWNDESETN∣ = 82 and ∣DOWNDESETNN∣ = 70 at 0.05 p-value cutoff. Table 3 represents this. For DS3, we have identified a total of 536 differentially methylated genes (viz., ∣TOTALDMSET(N+NN)∣ = 536) including ∣HYPERDMSETN∣ = 74, ∣HYPERDMSETNN∣ = 181, ∣HYPODMSETN∣ = 118 and ∣HYPODMSETNN∣ = 163 at 0.05 p-value cutoff. Table 4 represents this. These genes (i.e., ∣TOTALDESET(N+NN)∣ or ∣TOTALDMSET(N+NN)∣) are used in the next. Hence, clustergram of the differentially expressed genes for Dataset 1 is shown in Fig. 5. A volcanoplot which is used for identifying up-regulated genes as well as down-regulated genes using SAM for Dataset 1, is presented in Fig. 6.

thumbnail
Table 2. Number of differentially expressed genes by different statistical tests for Dataset 1, where #Gup, #Gdw denote up and down-regulated genes, respectively. Here, Pearson’s correlation test can not be used as the number of experimental samples is not equal to the number of control samples.

https://doi.org/10.1371/journal.pone.0119448.t002

thumbnail
Table 3. Number of differentially expressed genes by different statistical tests for Dataset 2, where #Gup, #Gdw denote up and down-regulated genes, respectively.

https://doi.org/10.1371/journal.pone.0119448.t003

thumbnail
Table 4. Number of differentially methylated genes by different statistical tests for Dataset 3, where #Ghyper and #Ghypo refer to hyper and hypo-methylated genes, respectively.

https://doi.org/10.1371/journal.pone.0119448.t004

thumbnail
Fig 5. The clustergram of the common differentially expressed genes (by different statistical tests) for DS1.

Here, red colour denotes up-regulation of genes across the specific samples/conditions, and green colour denotes down-regulation of genes across the specific samples/conditions.

https://doi.org/10.1371/journal.pone.0119448.g005

thumbnail
Fig 6. Volcanoplot for identifying differential up and down-regulated genes from Dataset 1 by SAM.

https://doi.org/10.1371/journal.pone.0119448.g006

Furthermore, we have prepared an artificial large microarray dataset (denoted by “Dataset 4” or “DS4”) having large number of experimental samples as well as large number of control samples (viz., 100 experimental samples and 100 control samples) and 30000 genes. We have included this dataset into our experiment for validating the performance of classifiers for large number of samples. We have applied our proposed method on the DS4 like DS1, DS2 and DS3. For the DS4, we have obtained that 5348 genes are following normal distribution, and remaining 24462 genes do not follow normal distribution. Thereafter, we have determined a total of 942 differentially expressed genes (viz., ∣TOTALDESET(N+NN)∣ = 942) including ∣UPDESETN∣ = 51, ∣UPDESETNN∣ = 532, ∣DOWNDESETN∣ = 23 and ∣DOWNDESETNN∣ = 336 at 0.05 p-value cutoff. Table 5 shows this.

thumbnail
Table 5. Number of differentially expressed genes by different statistical tests for the artificial Dataset 4, where #Gup, #Gdown denote up-regulated and down-regulated genes, respectively.

https://doi.org/10.1371/journal.pone.0119448.t005

As stated in section Materials and Methods, after statistical testing analysis, we have applied discretization, post-discretization, dividing the data into training and test sets using 4-fold CVs, finding MFCHOIs (i.e., maximal biclusters), and extracting the special rules, consecutively. E.g., a bicluster of a few DEup genes (viz., DISP1, DNALI1, RGN, RPL13P5, FBXO2 etc.) and DEdown genes (viz., KIF11, CENPN, CENPW, DTL, UHRF1, CDCA4 etc.) across a set of two homogeneous conditions (i.e., AC-sample11/AC11 and AC-sample13/AC13) for DS1 is presented in Fig. 7 graphically. After extracting rules, ranking of them, weight assigning, majority voting and two-class classification are performed, consecutively (see section Materials and Methods). In our experiment, we have run 4-fold CVs for 10 times for each dataset. Thereafter, we have calculated average sensitivity, average specificity, average accuracy and average MCC, and finally compared these with the existing rule-based classifiers (presented in Table 6 for DS1, Table 7 for DS2, Table 8 for DS3, and Table 9 for DS4). For each of the four datasets, our proposed method provides the better average accuracy and average MCC than the other six classifiers. It provides 95%, 75.94%, 88.61% and 83.77% accuracies, and 0.88, 0.58, 0.77 and 0.64 MCCs for the four datasets, respectively. For DS1, it provides the best average sensitivity (viz., 99.25%) and average specificity (viz., 85.55%). For DS2, it shows the best average sensitivity (i.e., 98.13%), but its average specificity (i.e., 53.75%) is less than the other classifiers. Similarly, for DS3, it produces the best average sensitivity (i.e., 90.56%), but its average specificity (i.e., 86.67%) is less than some of the four classifiers (viz., PART, JRip, ConjunctiveRule and OneR). For DS4, it produces the best average sensitivity (i.e., 84.13%), but its average specificity (i.e., 83.62%) is less than the “PART” classifier. Fig. 8 shows the comparison of the dataset-wise average accuracies and average MCCs, respectively among our proposed method and the other rule-based classifiers for the four datasets. Thereafter, a statistical significance test (viz., One-way Anova) is performed in between our method and each of the other classifiers pairwise. We have obtained significant p-values for all the pairwise comparisons (i.e., statistical tests) for DS1 and DS2, whereas we have identified that five among the six comparisons are statistically significant for DS3 as well as DS4 (see Fig. 9). The significance level (i.e., p-value) of each comparison is presented in Table 10.

thumbnail
Table 6. Comparative performance analysis of the rule-based classifiers on Dataset 1, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

https://doi.org/10.1371/journal.pone.0119448.t006

thumbnail
Table 7. Comparative performance analysis of the rule-based classifiers on Dataset 2, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

https://doi.org/10.1371/journal.pone.0119448.t007

thumbnail
Table 8. Comparative performance analysis of the rule-based classifiers on Dataset 3, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

https://doi.org/10.1371/journal.pone.0119448.t008

thumbnail
Table 9. Comparative performance analysis of the rule-based classifiers on Dataset 4, respectively (at 4-fold CVs repeating for 10 times); where bold font denotes the highest value for each column.

https://doi.org/10.1371/journal.pone.0119448.t009

thumbnail
Table 10. p-value of Anova 1 between the avg. accuracies of the proposed and other classifiers (pairwise) in DS1, DS2, DS3 and DS4 (where ‘S’ and ‘NS’ refer to significant (p-value ≤ 0.05) and non-significant (p-value > 0.05) p-values respectively).

https://doi.org/10.1371/journal.pone.0119448.t010

thumbnail
Fig 7. A graphical representation of the gene expression of a maximal homogeneous bicluster (i.e., a MFCHOI) over different samples.

https://doi.org/10.1371/journal.pone.0119448.g007

thumbnail
Fig 8. Barcharts: (a) comparison of dataset-wise average accuracies, and (b) comparison of dataset-wise average MCCs, among our proposed and other existing rule-based classifiers for the four datasets.

https://doi.org/10.1371/journal.pone.0119448.g008

thumbnail
Fig 9. Boxplots of significance tests (i.e., one-way Anova) for identifying level of significances (i.e., p-values) of accuracies between the proposed and other rule-based classifiers (pairwise) for Dataset 1 [in (a).(i-vi)], Dataset 2 [in (b).(i-vi)], Dataset 3 [in (c).(i-vi)] and Dataset 4 [in (d).(i-vi)]; where (i) proposed vs ConjunctiveRule, (ii) proposed vs DecisionTable, (iii) proposed vs JRip, (iv) proposed vs OneR, (v) proposed vs PART and (vi) proposed vs Ridor; (here vertical axis denotes the accuracy of the classifier).

https://doi.org/10.1371/journal.pone.0119448.g009

For DS1, we have obtained a set of rules, where some of these are of cancer subtype AC, and remaining of these are of cancer subtype SCC (i.e., these rules have ‘class = AC’/‘class = SCC’ in their consequent parts). We have produced a list of top frequent genes occurring in the evolved rules of each cancer subtype. We have identified ‘CENPA-’ as top frequent gene for cancer subtype AC. But, ‘CENPA-’ is not found in any evolved rule of cancer subtype SCC. Therefore, it is important for cancer subtype AC. Some literature evidences have been found in [29, 30] related to it. (According to the dataset, fold change of CENPA gene is 0.791 which indicates down-regulation of it as this value is less than 1; and p-values of it are 5.00E-05 in SAM, 2.53E-05 in Limma, 7.46E-07 in permuted t-test and 8.42E-06 in Wilcoxon ranksum test which indicate that it is a differentially expressed gene.) Besides that, we have also found literature-match of TTK in [31]. CENPN, KIF2C, EZH2 genes are also found in literatures [32, 33] related to AC. Similarly, ‘SHROOM3-’ is the top frequent gene for SCC. In DS2, we have identified a set of rules, where some of them are of UL class (i.e., tumor class), and remaining rules of them are of MM class (i.e., normal class). We have found some literature documentations about PRL in [34], TRPC6 in [35], and IGF2 in [36]. Similarly, for DS3, we have identified some set of rules, where some of them are of UL class (i.e., tumor class), and remaining rules of them are of MM class (i.e., normal class). The top 10 frequent genes for the two classes for the datasets are shown in Table 11, respectively. Some evidence of PLP1 in forming the tumor in [37] and similar documentary support of TRPM2 in forming the tumor are found in [38]. In Fig. 10 depicts two examples of how significant biomarkers are identified for each class-label for each real dataset.

thumbnail
Table 11. Top 10 frequent genes in evolved rules of the two class-labels for DS1, DS2 and DS3, respectively.

Ruleexperimental and Rulecontrol denote the set of the evolved rules of experimental class-label, and the set of the evolved rules of control class-label, respectively.

https://doi.org/10.1371/journal.pone.0119448.t011

thumbnail
Fig 10. Two examples of how significant biomarkers are identified from the maximal homogeneous biclusters (i.e., MFCHOI) for each class-label for each dataset.

Here, we are shown intersection of only four maximal homogeneous biclusters for (a) the class-label AC and (b) the class-label SCC, individually (for Dataset 1). For the class AC, CENPA-, TTK-, KIF11-, KIF18B- and ZNF367- are the top frequent genes as they exist in the four biclusters (see (a)); similarly, for the class SCC, SHROOM3- is top frequent gene as it exists in the four biclusters (see (b)).

https://doi.org/10.1371/journal.pone.0119448.g010

Besides that, KEGG pathway and GO analysis have been performed on the genes of the evolved rules for the three real datasets using David Database. The significant KEGG pathways, GO:BP, GO:CC and GO:MF terms are presented in Table 12. Some top important rules w.r.t. their existing KEGG pathways, GO:BPs, GO:CCs and GO:MFs, individually are shown in Table 13, Table 14 and Table 15 for the three real datasets (i.e., DS1, DS2 and DS3), respectively. Here, an important rule w.r.t. their existing pathways/GO-terms refers to such a rule whose all the genes (i.e., all genes of consequent part) involve together in maximum number of pathways/GO-terms. The details about the significant KEGG pathways, GO:BPs, GO:CCs and GO:MFs are presented in S1 File, S2 File and S3 File for the three datasets, respectively. S2 Text shows the top 15 rules from the three real datasets.

thumbnail
Table 12. KEGG pathway, GO:BP, GO:CC and GO:MF analysis of corresponding genes of the evolved rules from the three datasets.

Here, ‘satisfiable rule’ or SRule by some KEGG-pathway(i.e., Path)/GO:BP/GO:CC/GO:MF means that all the genes (i.e., antecedent) of the rule are occurred together in the pathway/Go-term.

https://doi.org/10.1371/journal.pone.0119448.t012

thumbnail
Table 13. Some top important rules w.r.t. their existing KEGG pathways/GO:BPs/GO:CCs/GO:CCs/GO:MFs in Dataset 1.

https://doi.org/10.1371/journal.pone.0119448.t013

thumbnail
Table 14. Some top important rules w.r.t. their existing KEGG pathways/GO:BPs/GO:CCs in Dataset 2.

Here, we have got no such significant rule w.r.t. their existing GO:MFs for the dataset.

https://doi.org/10.1371/journal.pone.0119448.t014

thumbnail
Table 15. Top important rules w.r.t. their existing KEGG pathways/GO:BPs/GO:CCs/GO:MFs in Dataset 3.

https://doi.org/10.1371/journal.pone.0119448.t015

Furthermore, we have compared our proposed rule mining algorithm with the existing ARM algorithms (viz., Traditional Apriori, AprioriTid, Eclat, Tao et al. and H-mine). But, since the number of genes of the datasets are high (i.e., greater than 250), and the other techniques generate frequent itemsets, therefore those methods fails to work on DS1 and DS4, and take long time to execute on DS2 and DS3 (i.e., nearly 5 hours or more) whereas our method can work efficiently on them as our method extracts maximal frequent closed homogeneous itemsets (MFCHOIs) using the biclustering technique. Therefore, we have made two artificial binary datasets which are prepared by taking random binary digits (namely 1 or 0). One dataset (denoted by ArDS5) among them has 100 genes and 60 samples (i.e., a total of 30 experimental samples and 30 control samples), and other dataset (denoted by ArDS6) includes 200 genes and 60 samples (i.e., a total of 30 experimental samples and 30 control samples). Thereafter, we have applied the binary matrix as input of the BiMax biclustering. For each of ArDS5 and ArDS6, we have observed that our proposed method produces much less number of significant non-redundant itemsets than the other rule mining algorithms as our proposed method determines the maximal frequent closed homogeneous itemsets (MFCHOIs) which are proper subsets of frequent itemsets (FIs); i.e., MFCHOIFI. Furthermore, as a single rule is extracted from a single significant itemset (i.e., MFCHOI) in our method, thus the number of evolved rules are same with the number of significant itemsets (i.e., MFCHOIs). Therefore, the evolved rules in our method are much less in number than the other methods. Thus, elapsed time of our method is much less than the other methods; and it can work on big/medium data (i.e., the data having more than nearly 250 genes). Fig. 11 presents the comparison of the number of significant non-redundant itemsets among our proposed method (i.e., StatBicRM) and the other methods for details. S3 Text shows the comparative study about the number of evolved rules as well as total elapsed time among our proposed rule mining method and the other rule mining methods for the two artificial binary datasets (i.e., ArDS5 and ArDS6).

thumbnail
Fig 11. Comparison of number of significant itemsets between StatBicRM and other existing ARM methods at different minimum support for the two artificial datasets (viz., ArDS5 and ArDS6).

“Significant itemset” refers to MFCHOI for StatBicRM, and FI for the other methods.

https://doi.org/10.1371/journal.pone.0119448.g011

Integrative analysis of Gene Expression dataset and Methylation dataset

For integrative analysis of gene expression dataset and methylation dataset, we have to consider only matched samples and matched genes from both the datasets (having same dataset Reference ID). Dataset 2 (NCBI Ref. ID:- GSE31699, gene expression dataset) and Dataset 3 (NCBI Ref. ID:- GSE31699, methylation dataset) have 13072 common genes (i.e., combined dataset). For integrative analysis of the combined dataset, at first we have identified 16 matched samples from the combined dataset. Thereafter, we have used the normality test on the matched expression dataset as well as matched methylation dataset individually. For the matched expression dataset, we have identified that 10236 of the matched genes are normally distributed, and rest of them (i.e., 2836 genes) are not normally distributed. For the matched methylation dataset, we have found 8173 genes which are following normal distribution, and remaining 4899 genes that do not following normal distribution. Then, the four parametric tests are applied on the normally distributed genes, and the four non-parametric tests are applied on the non-normally distributed genes, both at 0.05 p-value threshold for the matched expression dataset as well as the matched methylation dataset. We have found 54 common up-regulated genes and 82 common down-regulated genes from the parametric tests, and 86 common up-regulated genes and 70 common down-regulated genes from the non-parametric tests for the expression dataset (viz., ∣UPDESETN∣ = 54, ∣UPDESETNN∣ = 86, ∣DOWNDESETN∣ = 82 and ∣DOWNDESETNN∣ = 70). Thereafter, we merge the common up-regulated and common down-regulated genes collected from the results of the parametric and nonparametric tests for the matched expression dataset (viz., ∣TOTALDESET(N+NN)∣ = 292). Similar statistical analyses are performed for the matched methylation dataset (viz., ∣HYPERDMSETN∣ = 89, ∣HYPERDMSETNN∣ = 174, ∣HYPODMSETN∣ = 95, ∣HYPODMSETNN∣ = 165, and ∣TOTALDMSET(N+NN)∣ = 523). Thereafter, our proposed method is applied on the TOTALDESET(N+NN) genes as well as the TOTALDMSET(N+NN) genes, individually (see S4 Text).

Furthermore, we have concentrated on the internal relationship between the gene expression and methylation. As we know that the gene expression is inversely proportional to the methylation, therefore inversely correlated genes make sense to highlight the effect of methylation (i.e., epigenetic effect) on the expression level. For the combined dataset, six common genes have been identified which are up-regulated as well as hypo-methylated. These genes are SEMA7A, FSD1, TUBB3, GLIS1, TDO2 and SHOX2. Subsequently, nine common genes are also detected that are both down-regulated and hyper-methylated. These genes are HOXB8, SLC25A18, CALCRL, TMEM71, C1orf115, EDG1, CCDC68, NUAK1 and CMTM8. These two types of genes are important for highlighting the effect of methylation (i.e., epigenetic effect) on the gene expression. However, the supplementary materials are available at: https://www.dropbox.com/sh/gsvgty85jdlp2b6/13fOZxJV8n.

Conclusion

In this article, we have proposed a computational rule mining framework to determine special type of association rules and potential biomarkers using integrated approaches of statistical and the BiMax biclustering techniques from the gene expression as well as methylation data. At first, different statistical techniques (viz., removal of genes having low variance, normalization, normality test, different parametric and non-parametric tests) are utilized on the the whole dataset to obtain proper non-redundant/high-significant subset of differentially expressed/methylated genes. The resulting subset of genes are used in next step. Thereafter, the data is discretized and post-discretized, consecutively. The biclustering technique is then utilized to determine MFCHOIs. Thereafter, the special rules are generated from the MFCHOIs. Our proposed rule mining method performs much better than the state-of-the-art rule mining algorithms as it generates MFCHOIs instead of FIs. Therefore, it saves running time, and it can able to work on the big dataset. Pathway and GO analyses have been performed on the genes of the evolved rules by David software. Occurrence of each gene in the evolved rules of each class-label is determined for identifying the potential biomarkers. Furthermore, we have also made classification the data to know how much the evolved rules are able to describe accurately the remaining test (unknown) data. Subsequently, we have also compared the average classification accuracy, and other related factors of our proposed method with the other existing rule-based classifiers. Statistical significance tests are also utilized for checking the statistical relevance of the comparisons. Here, each of the other rule mining methods or rule-based classifiers is also starting with the same post-discretized data-matrix. At the end, we have also performed the integrated analysis of the gene expression data and the methylation data for highlighting the epigenetic effect (viz., the effect of methylation) on the gene expression level.

Supporting Information

S1 Text. Brief descriptions about the used rule-interestingness measures.

https://doi.org/10.1371/journal.pone.0119448.s001

(PDF)

S2 Text. Top 15 rules of treated/experimental and control class labels from the results of average ranking for the three real datasets.

https://doi.org/10.1371/journal.pone.0119448.s002

(PDF)

S3 Text. Comparison among our proposed rule mining method and the other existing rule mining methods for the two artificial datasets.

https://doi.org/10.1371/journal.pone.0119448.s003

(PDF)

S4 Text. Integrative analysis of Gene Expression dataset and Methylation dataset.

https://doi.org/10.1371/journal.pone.0119448.s004

(PDF)

S1 File. Significant KEGG pathways, GO:BPs, GO:CCs and GO:MFs of the genes of the evolved rules for Dataset 1.

https://doi.org/10.1371/journal.pone.0119448.s005

(XLS)

S2 File. Significant KEGG pathways, GO:BPs, GO:CCs and GO:MFs of the genes of the evolved rules for Dataset 2.

https://doi.org/10.1371/journal.pone.0119448.s006

(XLS)

S3 File. Significant KEGG pathways, GO:BPs, GO:CCs and GO:MFs of the genes of the evolved rules for Dataset 3.

https://doi.org/10.1371/journal.pone.0119448.s007

(XLS)

Acknowledgments

We thank our entire group members for helping us.

Author Contributions

Conceived and designed the experiments: UM SM AM SB. Performed the experiments: SM. Analyzed the data: SM. Contributed reagents/materials/analysis tools: SM SB. Wrote the paper: SM SB.

References

  1. 1. Bandyopadhyay S, Maulik U, Wang J. Analysis of Biological Data: A Soft Computing Approach. World Scientific, Singapore; 2007.
  2. 2. Maulik U. Analysis of gene microarray data in a soft computing framework. Applied Soft Computing 2011; 11: 4152–4160.
  3. 3. Maulik U, Bandyopadhyay S, Wang J. Computational Intelligence and Pattern Analysis in Biological Informatics. Wiley, Singapore; 2010.
  4. 4. Mallik S, Mukhopadhyay A, Maulik U, Bandyopadhyay S. Integrated analysis gene expression and genome-wide DNA methylation for tumor prediction: An association rule mining-based approach. In: Proceedings IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), IEEE Symposium Series on Computational Intelligence (SSCI), Singapore 2013.
  5. 5. Dudoit S, Yang Y, Speed T, Callow M. Statistical methods for identifying differentially expressed genes in replicated cdna microarray experiments. Statistica Sinica 2002; 12: 111–139.
  6. 6. Kim SY, Lee JW, Sohn IS. Comparison of various statistical methods for identifying differential gene expression in replicated microarray data. Stat Methods Med Res 2006; 15: 3–20. pmid:16477945
  7. 7. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery sensitivity and sample size for microarray studies. Bioinformatics 2005; 21: 3017–3024. pmid:15840707
  8. 8. Mukhopadhyay A, Maulik U, Bandyopdhyay S. A novel biclustering approach to association rule mining for predicting hiv-1-human protein interactions. PLoS One 2012; 7: e32289. pmid:22539940
  9. 9. Creighton C, Hanash S. Mining gene expression databases for association rules. Bioinformatics 2003; 19: 79–86. pmid:12499296
  10. 10. Prelic A, Bleuler S, Zimmermann P, Wille A, Buhlmann P, Gruissem W, et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 2006; 22: 1122–1129. pmid:16500941
  11. 11. Bandyopadhyay S, Mallik S, Mukhopadhyay A. A survey and comparative study of statistical tests for identifying differential expression from microarray data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2014; 11: 95–115.
  12. 12. Jarque C, Bera A. A test for normality of observations and regression residuals. Int Stat Rev 1987; 55: 163–172.
  13. 13. Fox RJ, Dimmic MW. A two-sample bayesian t-test for microarray data. BMC Bioinformatics 2006; 7: 1–11.
  14. 14. Geng L, Hamilton HJ. Interestingness measures for data mining: A survey. ACM Computing Surveys 2006; 38: 9.
  15. 15. Tan PN, Kumar V, Srivastava J. Selecting the right objective measure for association analysis. Inform Syst 2004; 29: 293–313.
  16. 16. Mishra S. The most representative composite rank ordering of multi-attribute objects by the particle swarm optimization method. Journal of Quantitative Economics 2010; 8: 165–200.
  17. 17. Pieters BFI, Knobbe A, Dzeroski S. Subgroup discovery in ranked data, with an application to gene set enrichment. In: Proceedings preference learning workshop (PL 2010) at ECML PKDD 2010; 10: 1–18.
  18. 18. Lange, MM. Comparison of estimators for respondent-driven sampling. M.Sc. Thesis, University of California, Los Angeles. 2014. Available: http://www.escholarship.org/uc/item/77n9v5bd
  19. 19. Bewick V, Cheek L, Ball J. Statistics review 9: One-way analysis of variance. Critical Care 2004; 8: 130–136. pmid:15025774
  20. 20. Agrawal R, Imielinski T, Swami A. Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on Management of data (SIGMOD’93), New York, NY, USA 1993.
  21. 21. Agrawal R, Ramakrishnan S. Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases (VLDB), Santiago, Chile 1994; 5: 914–925.
  22. 22. Zaki MJ. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 2000; 12: 372–390.
  23. 23. Tao F. Weighted association rule mining using weighted support and significance framework. In: Proceedings of ACM SIGKDD, Washington D.C., USA 2003.
  24. 24. Pei J, Han J, Lu H, Nishio S, Tang S, Yang D, et al. H-mine: Fast and space-preserving frequent pattern mining in large databases. IIE Transactions 2007; 39: 593–605.
  25. 25. Vickers, A. Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data. BMC Medical Research Methodology 2005.
  26. 26. Thadewald T, Buning H. Jarque-bera test and its competitors for testing normality. OncoTargets and Therapy 2007; 34: 87–105.
  27. 27. Smyth G. Linear models and empirical bayes methods for assessing linear models and empirical bayes methods for assessing. Statistical Applications in Genetics and Molecular Biology 2004; 3: 3.
  28. 28. Devasena C, Sumathi T, Gomathi V, Hemalatha M. Effectiveness evaluation of rule based classifiers for the classification of iris data set. Bonfring International Journal of Man Machine Interface 2011.
  29. 29. Wua Q, Qianb Y, Zhaoc X, Wang SM, Feng XJ, Chen XF, et al. Expression and prognostic significance of centromere protein A in human lung adenocarcinoma. Elsevier 2012; 77: 407–414.
  30. 30. Toh SH, Prathipati P, Motakis E, Kwoh CK, Yenamandra SP, Kuznetsov VA. A robust tool for discriminative analysis and feature selection in paired samples impacts the identification of the genes essential for reprogramming lung tissue to adenocarcinoma. BMC Genomics 2011; 12: S24. pmid:22369099
  31. 31. Landi M, Dracheva T, Rotunno M, Figueroa JD, Liu H, Dasgupta A., et al. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS One 2008; 3: e1651. pmid:18297132
  32. 32. Rhodes DR, Chinnaiyan AM. Integrative analysis of the cancer transcriptome. Nature Genetics 2005; 37: S31–S37. pmid:15920528
  33. 33. Wu G, Stein L. A network module-based method for identifying cancer prognostic signatures. Genome Biology 2012; 13: R112. pmid:23228031
  34. 34. Nowak R, Rein M, Heffner L, Friedman AJ, Tashjian AH Jr. Production of prolactin by smooth muscle cells cultured from human uterine fibroid tumors. J Clin Endocrinol Metab. 1993; 76: 1308–1313. pmid:8496322
  35. 35. Wan Q, Zheng A, Liu X, Chen Y, Han L. Expression of transient receptor potential channel 6 in cervical cancer. OncoTargets and Therapy 2012; 5: 171–176. pmid:22973112
  36. 36. Rainho C, Pontes A, Rogatto S. Expression and imprinting of insulin-like growth factor ii (igf2) and h19 genes in uterine leiomyomas. Gynecologic Oncology 1999; 74: 375–380. pmid:10479496
  37. 37. Navarro A, Yin P, Monsivais D, Lin SM, Du P, Wei JJ, et al. Genome-wide DNA methylation indicates silencing of tumor suppressor genes in uterine leiomyoma. PLoS One 2012; 7: e33284. pmid:22428009
  38. 38. Baba AL, Catoi C. Comparative Oncology. Bucharest: The Publishing House of the Romanian Academy, 2007.