A fuzzy co-clustering algorithm for biomedical data

Fuzzy co-clustering extends co-clustering by assigning membership functions to both the objects and the features, and is helpful to improve clustering accurarcy of biomedical data. In this paper, we introduce a new fuzzy co-clustering algorithm based on information bottleneck named ibFCC. The ibFCC formulates an objective function which includes a distance function that employs information bottleneck theory to measure the distance between feature data point and the feature cluster centroid. Many experiments were conducted on five biomedical datasets, and the ibFCC was compared with such prominent fuzzy (co-)clustering algorithms as FCM, FCCM, RFCC and FCCI. Experimental results showed that ibFCC could yield high quality clusters and was better than all these methods in terms of accuracy.


Introduction
Nowadays, the amount of biomedical data grows rapidly, which makes it difficult for medical workers and patients to find the information they need. The clustering technique can identify the latent structure and knowledge behind large-scale biomedical data, and therefore play an important role in reorganizing biomedical data and helping users find relevant information. This technique tries to generate a set of clusters where intra-cluster similarity is maximized and inter-cluster similarity is minimized, and is widely used for such applications as automatic categorization of text, grouping gene expression data, and others [1,2].
In recent years many researchers have studied data mining and presented a number of clustering algorithms [3][4][5][6][7]. These algorithms can be divided into hard and soft clustering algorithms [8]. Hard clustering has been studied extensively and well accepted by the scientific community. For example, Chen et al [9] studied hard clustering and proposed an automated two-level variable weighting clustering algorithm for multiview data, which can simultaneously compute weights for views and individual variables. In hard clustering, each object belongs to exactly one cluster, while soft clustering allows an object to belong to more than one cluster. For example, nodular goiter can be put into two clusters, Thyroid Surgery and Endocrinology. As another example, the atypical hyperplasia could be considered as normal endometrium or abnormal endometrium by different doctors. Above examples tell us soft clustering may be more reasonable than hard clustering, because many times we cannot put an object into just one cluster. PLOS  When mentioning soft clustering, we need to talk about fuzzy clustering, which is regarded as the combination of clustering and fuzzy sets. Fuzzy clustering is relatively new. Its representative algorithm is the Fuzzy c-Means (FCM) algorithm, which is the fuzzy version of traditional K-Means clustering algorithm. The main difference is that the K-Means is a hard algorithm, while the FCM is a soft algorithm. In other words, K-Means represents the affiliation of objects to clusters by memberships taking values 0 and 1, however, in FCM, the memberships take values in the real unit interval [0, 1] [10,11]. Therefore, the FCM is, indeed, the fuzzy version of the K-Means. Conversely, the K-Means can be regarded as a special case of the FCM. Researchers have developed FCM in recent years. Jiang et al [12] studied how to combine the clustering result from each view and proposed a collaborative fuzzy c-means (Co-FCM) algorithm.
The FCM is a kind of one-dimensional clustering algorithm. That is to say, when grouping a disease-symptom contingence table, the FCM assumes that there is no relationship between the symptoms, and just classifies the diseases based on the symptoms. Actually, we are aware that there may exist mutual influence between some diseases, for example, there is a close relation between increased pulse pressure and types of metabolic diseases. As this is the case, it is unscientific to neglect the correlations between the symptoms. If the disease-symptom contingence table is considered unrepresentative, we can discuss a more typical example, i.e. a document-word matrix. In exactly the same way, if we analyze a document-word matrix, we had better think highly of the correlations between words, because as is known to all, some words are synonyms and some words are antonyms. Thus it can be seen, when we are analyzing an object-feature contingence table for clustering, we should group both the object and feature dimensions. Accordingly, the two-dimensional fuzzy clustering algorithms, called fuzzy coclustering algorithms, are better than the one-dimensional FCM, especially when there are strong correlations between features.
Fuzzy co-clustering can simultaneously group objects and features based on the co-occurrence information [13][14][15]. As a result, more relationships between objects and features are kept, and therefore we can get more interpretable clustering results. At the same time, because the features are also partitioned into feature clusters, which means the feature dimensionality is reduced significantly, the clustering process will be accelerated. So far, many fuzzy co-clustering algorithms have been presented. The FCCM (Fuzzy Clustering for Categorical Multivariate data) [14] is the best-known fuzzy co-clustering algorithm, which can be regarded as a two-dimensional FCM. Other prominent fuzzy co-clustering algorithms include FCR (Fuzzy co-Clustering with Ruspini's condition) [16], FCCI (Fuzzy Co-Clustering algorithm for Images) [17], PFCC (Possibilistic Fuzzy Co-Clustering) [18], RFCC (Robust Fuzzy Co-Clustering) [19] and SS-HFCR (Heuristic Semi-Supervised Fuzzy co-Clustering algorithm) [20], etc. In order to compare these algorithms, we first give the explanations on the mathematical notations used in this paper (as Table 1). With the mathematical notations, objective functions of some popular fuzzy co-clustering algorithms mentioned above are provided in Table 2.
The FCCI algorithm is one of the most important fuzzy co-clustering algorithms. This algorithm includes a multi-dimensional distance function as the dissimilarity measure and entropy as the regularization term in its objective function. The FCCI emphasizes the importance of distance function, and its distance function equals the square of the Euclidean distance between feature data point and the feature cluster centroid. However, we all know that there are many similarity measures in the fields of data mining and pattern recognition [21]. The previous work of ours as well as other researchers' show that information bottleneck based similarity measure is a more desirable choice because this similarity measure proves much better and can achieve much higher accuracy than other measures in clustering [22][23][24]. In the work of S. Noam and T. Naftali [23], the experimental results showed the average performance over all datasets attained 0.55 accuracy, while the second best result was 0.47 accuracy. Ye et al. [25] presented a novel alternative clustering algorithm, named SmIB, which employed mutual information to measure the information resided in data, and experimental results demonstrated that the SmIB algorithm was superior to the existing state-of-the-art alternative clustering algorithms.
Above analysis motivates us to present a novel Fuzzy Co-Clustering algorithm based on information bottleneck similarity measure, called ibFCC. This approach assigns membership functions to both the objects and the features. Besides, because the biomedical data comes in a variety of forms, it is difficult for us to select just one appropriate method to calculate the pairwise object similarity. We think the information bottleneck based similarity measure is much more appropriate. In the ibFCC, an objective function is formulated, which includes a distance function that employs information bottleneck theory to measure the similarity between feature data point and the feature cluster centroid.
The remainder of this paper is organized as follows. We firstly introduce in details the ibFCC, and then present our experimental results on five datasets, Ohsumed [26], Lung Cancer [27], Breast Tissue [28], Cardiotocography [28] and Mice Protein Expression [28]. Finally, we conclude our work.

The ibFCC algorithm
Since distance function is very necessary for fuzzy co-clustering to create richer co-clusters [17], FCCI includes the Euclidean distance function of feature data points from the feature cluster centroids in the co-clustering process. However, as we all know, there are so many other distance measures besides Euclidean distance function that it is difficult for users to choose an appropriate one. Too often this is an arbitrary choice. In the study of clustering, information bottleneck based distance measure proves much better. Therefore, the ibFCC algorithm we proposed employs information bottleneck theory to measure distance between feature data points and the feature cluster centroids. The overall clustering process is illustrated in Fig 1. The goal of ibFCC is to minimize the objective function in Eq 1, subject to the following constraints in Eqs 2 and 3.
The first term in Eq 1 is the degree of aggregation that should be minimized during co-clustering, which intends to enable highly related objects-features to be co-clustered together. The u ci and v cj are two membership functions, indicating memberships of documents and features, Table 2. Comparison of some popular fuzzy co-clustering algorithms.

Name
Objective function Description upos ci is the document possibilistic membership, w cj is the word partitioning membership, T w is a user-defined parameter x ci is a new additional and robust type of object membershp, T x is a user-defined parameter respectively. The second and third terms are entropy regularization factors that combine all u ci 's and v cj 's separately. They control the degree of fuzziness in final clusters, where T u and T v are weighting parameters. The constrained optimization of ibFCC can be solved by applying the Lagrange multipliers α, β to constraints in Eqs 2 and 3 respectively.
Take the partial derivative of J' ibFCC in Eq 4 with respect to U and V respectively and set the gradient to zero, and then we have, Solving above equations yields the formulae for u ci , v cj as:  Eqs 7 and 8 are the update equations for the document and feature memberships, where d cij is distance between feature data point and the feature cluster centroid.
Let c 1 and c 2 be two clusters, and the distance between c 1 and c 2 is measured by information loss due to the merging of c 1 and c 2 based on Eq 9 as follows, where I(C before , Y) and I(C after , Y) are the mutual information before and after the two clusters, c 1 and c 2 , are merged together, C before and C after are the clusters before and after the mergence, Y is the feature space, and y is one feature.
Let the i th document be a singleton cluster sc i , x ij denotes the j th feature value of the i th document, P = {p cj } be the set of feature cluster centroids. Thus, Eq 9 can be rewritten to calculate the distance between this cluster sc i and the c th cluster, as where |sc i | = 1 because this cluster has only one object. The d cij is the j-th component product of d(sc i , c), and we can get, where t cij = (x ij +|c| Ã p cj )/(1+|c|), |c| is the number of documents in the c th cluster. It is a little more complicated to define the value of |c| in fuzzy clustering than in hard clustering, because we need to perform defuzzification operation on the fuzzy membership matrix. After defuzzification, we can get the value of |c| as easily as in hard clustering. Note that in our ibFCC, it is difficult to get the value of p cj explicitly. Even if the value of p cj may be calculated as u ci and v cj theoretically, the process may suffer from high computational complexity mathematically. Thus we choose an alternative approach which employs a weighted averaging method. In fuzzy clustering, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster. And then we have the normalized update equation of p cj , Through Eqs 7 and 8, the solution of the constrained optimization problem in Eq 4 can be approximated by Picard iteration. The proof of convergence of the ibFCC algorithm is given in the Appendix section of this paper. The pseudocode of ibFCC is given in Algorithm 1.

Algorithm effectiveness tests
In order to test the effectiveness of ibFCC, we carried out a set of experiments. The experimental results are also compared with four well received approaches in the literature, FCM, FCCM, RFCC and FCCI. Of the four algorithms, FCM is a standard fuzzy clustering algorithm, and the others are fuzzy co-clustering algorithms.
Experimental setup. We employed five datasets to evaluate the performance of ibFCC in categorizing real-world data, Ohsumed, Lung Cancer, Breast Tissue, Cardiotocography and Mice Protein Expression.
1) The Ohsumed corpus is the collection consisting of the first 20,000 documents from the 50,216 medical abstracts of the year 1991. The classification scheme consists of the 23 Medical Subject Headings (MeSH) diseases categories. Based on the Ohsumed corpus, we constructed two subsets, Oh1 and Oh2, which are introduced in Table 3. In our experiments on the Ohsumed corpora, we selected top 500 features, that is, K = 500.
2) The Lung Cancer (LC) dataset is used by Hong and Young to illustrate the power of the optimal discriminant plane even in ill-posed settings. It contains 27 instances and 56 attributes. We used the existing classification as our baseline on how the dataset should be clustered.
3) The Breast Tissue (BT) corpus can be used for predicting the classification of either the original 6 classes or of 4 classes by merging together the fibro-adenoma, mastopathy and glandular classes whose discrimination is not important (they cannot be accurately discriminated anyway). It contains 106 instances and 9 attributes. 4) In the Cardiotocography (Card) dataset, 2126 fetal cardiotocograms (CTGs) were automatically processed and the respective diagnostic features were measured. The CTGs were also classified by three expert obstetricians with a consensus classification label assigned to each of them.

5) This Mice Protein Expression (MPE) dataset contains a total of 1076 measurements per protein.
Each measurement can be considered as an independent sample/mouse. The eight classes of mice are described based on features such as genotype, behavior and treatment. Evaluation criteria. There are several ways for numerically scoring the cluster quality, such as Entropy, F-Measure and Overall Similarity. We choose F-Measure, Entropy and pvalue as the criteria to evaluate the performance of ibFCC.
F-Measure is the weighted harmonic mean of precision and recall. In terms of evaluating clustering accuracy, the higher the value of F-Measure is, the better the clustering quality is. And the F-Measure value of is given by: Fði; jÞ ¼ 2Ã precisionði; jÞÃ recallði; jÞ precisionði; jÞ þ recallði; jÞ ð13Þ where precision(i,j) and recall(i,j) are computed using the following equations respectively: where n ij is the number of members of class i in cluster j, n j is the number of members of cluster j, and n i is the number of members of class i. The overall value for the F-Measure is given by the following: where n is the total number of documents. The Entropy can also be used to evaluate cluster distribution during clustering in information theory. The expression for Entropy of clustering result is listed as follows: where E cs is the whole Entropy value, n j is the number of documents in cluster j, n is the number of all the documents, m is the number of clusters and E j is the Entropy value of cluster j, which is calculated using the following formula: where p ij is the probability that one document belonging to class i could be put into cluster j during the partition. It should be noted that the lower the value of Entropy, the higher the clustering quality will be. In research of GO (Gene Ontology) whose objective is to provide controlled vocabularies for the description of the biological process, molecular function, and cellular component of gene products, the p-value is often used to calculate the statistical significance of a group of proteins that shares a GO term [29]. In the dataset, given N proteins where M of them have the same annotation, the probability of observing m or more proteins that are annotated with the same GO term out of n proteins is, A cluster with a smaller p-value is usually more significant than one with a higher p-value. After getting the p-value of each single cluster, the quality of overall clusters could be measured by the CS (clustering score) function, which is calculated as follows.
where ns and nl is the number of significant and insignificant clusters, respectively. The cutoff denotes the α level (0.05), and if a group of proteins are associated with a p-value less than the cutoff, they are considered significant, and vice versa. The min(p i ) is the smallest p-value of the significant cluster i.

Results
We firstly compared the performances of FCM, FCCM, RFCC, FCCI and ibFCC on the six subsets. All the five algorithms were initialized randomly and run for ten times to reduce the impact of local optimizations. .68 in terms of Entropy, whose performance is relatively better than FCM, FCCM and RFCC. At the same time, we observed that the F-Measure values of these algorithms are higher, and the Entropy values are lower, when the value of C is small. As the value of C increases, the F-Measure and Entropy values show that the performances of these clustering algorithms reduce, however, the clustering accuracy of the ibFCC is still the highest.
In addition to F-Measure and Entropy, we chose clustering score and p-value to further evaluate the performances of the ibFCC. The experimental results in terms of clustering score are illustrated as Fig 3, which shows the comparison of the five algorithms. On the six subsets, the clustering score values of the ibFCC are much lower, and thus this algorithm achieves a significant improvement than the counterparts. However, on the BT and MPE dataset, the clustering score value of ibFCC is only slightly less than FCCI, which shows that the clustering accuracy of these two algorithm is similar. To be sure, the experimental results illustrated in In addition, in clustering results of FCCI and FCM, there are often some empty clusters, which will easily bring a higher clustering accuracy (higher F-Measure value and lower Entropy value) because the number of C is lower.
The following experiments illustrate the significance of our clustering results in terms of p-value. Experimental results on the six subsets are listed as Fig 4A, 4B, 4C, 4D, 4E and 4F, respectively. In Fig 4A, the p-values of the best clusters of the five algorithms are 8.0E-09, 0.045, 1.1E-10, 0.031 and 3.6E-30, respectively. And similarly, our algorithm has or approaches (only on the Card subset in Fig 4E) the lowest p-value. Results of this set of experiments show that biomedical data can be grouped into more meaningful clusters, and our algorithm could provide more significant clusters.
The corresponding document cluster distributions are shown in Fig 5. Clustering results of LC, Oh1 and BT are illustrated as Fig 5A, 5B and 5C. Because the number of clusters is large on Oh2, Card and MPE datasets, it is difficult for clustering results to be illustrated in figures. And the experiments on the Oh2, Card and MPE datasets are not discussed here. It can be seen from Fig 5 that ibFCC can generate clusters better than other algorithms. Fig 5 shows that ibFCC well generates C1 on LC, C3 and C4 on Oh1, C1, C3 and C4 on BT. Clustering performances of FCCM and RFCC are similar, and it is difficult for these two algorithms to capture categories properly. FCM and FCCI perform well on a part of datasets such as the LC subset.

Discussion
In our experiments, some clusters have few documents, such as some clusters generated by FCCI. We gave some analysis on the problem and concluded that when datasets were sparse and high-dimensional, all the objects could be assigned to a single cluster in FCM-type clustering [30]. The six subsets are exactly sparse, and thus in clustering results of such fuzzy co-clustering algorithms as FCCM, RFCC and FCCI, some clusters have no objects (as Fig 5), which will significantly reduce clustering performance. To avoid the problem, Mei et al. [30] proposed a method to normalize all the centroids to unit norm after each iteration where δ c is the centroid of the c-th cluster, and where m is a constant, w i controls the weights of objects, and δ' c is the normalized centroid. Their algorithm is an incremental clustering method, and thus does not appear in our experiments.
In ibFCC, centroids and objects are assigned different weights in calculating information bottleneck based similarity, as Eq 11, which is equivalent to the normalization process of Mei et al. Therefore, in experimental results of ibFCC, there are less empty clusters, and clustering performance is much better.  Fuzzy co-clustering biomedical data zero empty clusters in the results, and therefore, this algorithm outperforms the counterparts. The FCCI algorithm has the second best clustering results, with only 0.1, 0.5, 0.1 empty clusters on Oh1, Oh2 and LC subsets respectively.   In addition to the number of empty result clusters, running time is also an important issue. As indicated earlier, the time complexity of ibFCC is O(CNKτ), which is equivalent to such fuzzy co-clustering algorithms as FCCM, RFCC and FCCI. Even if the FCM algorithm implements fuzzy clustering rather than fuzzy co-clustering, its time complexity is also O(CNKτ). However, time complexity merely manifests the conclusion of theoretical analysis. In order to thoroughly compare these algorithms, we carried out additional experiments to record clustering time. The running time required by every algorithms to complete oncethrough clustering on each dataset is listed as Table 4. The comparison indicates that, on the six datasets, the FCM algorithm is the most time-consuming. The main reason lies in that this algorithm is sensitive to noise, which reduces significantly the convergence speed. Although other four fuzzy co-clustering algorithms seem to be more complex, they group objects as well as features, which could help to significantly reduce feature dimension and improve clustering efficiency. Thus it can be justified again that the fuzzy co-clustering algorithms are better than fuzzy  clustering algorithms. The comparisons of the four fuzzy co-clustering algorithms in terms of running time show that the FCCM performs the best, and the ibFCC takes longer time. The former is because computational procedure of the FCCM is very easy, and the latter is because of the complex similarity measure based on information bottleneck of ibFCC. The similarity measure of FCCI is more complex than FCCM and RFCC, which makes FCCI needs more time to complete clustering. Similarly, the information bottleneck based measure of ibFCC is more timeconsuming than FCCI, and therefore, the running time of ibFCC is longer than FCCI. Even so, the ibFCC is still more efficient than FCM. In conclusion, the ibFCC achieves high clustering accuracy while encountering more actual running time because of the calculation process of similarity measure, although its theoretical time complexity does not increase. Therefore, it will be a study emphasis in our further research how to further improve actual running efficiency.

Conclusion
Recently, several fuzzy co-clustering algorithms have been proposed. Keeping the advantages of co-clustering and fuzzy clustering, these algorithms improve the representation of overlapping clusters by using fuzzy membership function, and greatly facilitate the reorganization of large biomedical data.
In existing prominent fuzzy co-clustering algorithms, Euclidean distance function is the most frequently used. However, information bottleneck based distance measure proves much better in many clustering algorithms. Therefore, in this paper we propose a novel fuzzy coclustering algorithm, named ibFCC, whose objective function includes an information bottleneck based distance function to measure distance between feature data points and the feature cluster centroids. We implement experiments on five biomedical datasets, Ohsumed, Lung Cancer, Breast Tissue, Cardiotocography and Mice Protein Expression, to evaluate the performance of ibFCC. Our algorithm is also compared with some popular fuzzy (co-)clustering algorithms and proves to outperform them.
It is challenging to determine the number of clusters in the literature. In our study, the value of C is still specified by users manually, which determines that ibFCC is not unsupervised absolutely. In the future, we intend to incorporate techniques evaluating the number of clusters to optimize our approach. where Similarly, the variables v cj and d cij may be considered as two constants. And then theorem 1 can be proven by showing that the u Ã (i.e., the updated value of u ci given by Eq 7) is the local minima of the objective function J(U) by Lagrange multiplier method. For this we need to prove that the Hessian matrix 4 2 J(u

Corollary 1
The ibFCC algorithm converges to a local minimum of the optimization, with the update formulae given in Eqs 12,11,8 and 7. Proof This corollary is a direct consequence of the above three theorems. Theorems 1 and 2 indicate that the procedure of membership updating never increases the value of the ibFCC objective function. Theorem 3 states that there is a limit to how much this objective function can be decreased. So eventually the procedure should stop somewhere before or when it reaches this limit.