Identifying Subspace Gene Clusters from Microarray Data Using Low-Rank Representation

Identifying subspace gene clusters from the gene expression data is useful for discovering novel functional gene interactions. In this paper, we propose to use low-rank representation (LRR) to identify the subspace gene clusters from microarray data. LRR seeks the lowest-rank representation among all the candidates that can represent the genes as linear combinations of the bases in the dataset. The clusters can be extracted based on the block diagonal representation matrix obtained using LRR, and they can well capture the intrinsic patterns of genes with similar functions. Meanwhile, the parameter of LRR can balance the effect of noise so that the method is capable of extracting useful information from the data with high level of background noise. Compared with traditional methods, our approach can identify genes with similar functions yet without similar expression profiles. Also, it could assign one gene into different clusters. Moreover, our method is robust to the noise and can identify more biologically relevant gene clusters. When applied to three public datasets, the results show that the LRR based method is superior to existing methods for identifying subspace gene clusters.


Introduction
With the advent of the DNA microarray technology, it is now possible to study the transcriptional response of a complete genome to different experimental conditions. However, reconstruction the regulatory networks from the high throughput DNA microarray data is one of the foremost challenges of current bioinformatics research. Fortunately, many studies have unveiled that regulatory networks are modular and hierarchically organized [1][2][3][4][5][6][7]. Since subspace gene clusters may represent co-regulated genes to some degree, so we can reconstruct the whole regulatory network start with identifying the clusters. In addition, since genes can be clustered with similar cellular functions, therefore, identifying the clusters from DNA microarray data might provide much deeper insight into biological function and relevance.
Traditional clustering methods, such as hierarchical clustering [8], K-means clustering [9], self-organizing maps [10], and modelbased methods [11][12][13][14] can organize gene expression data into clusters of genes possessing similar expression profiles using all the conditions, and the clusters are exclusive and exhaustive (Figure 1  (A)). These methods identify gene clusters by assuming that genes with similar expression profiles share similar functions or the same pathway. Although these clustering methods classify genes successfully when applied to relatively small data sets, when used for analyzing large-scale expression data, it is limited by three wellrecognized drawbacks [3]. Firstly, commonly used algorithms assign each gene to a single cluster, whereas in fact many genes may participate in several functions and should thus be included in several clusters [3,10]. Secondly, these algorithms group genes on the basis of their expression under all experimental conditions, whereas cellular processes are generally affected only by a small subset of these conditions, so that a gene can participate in multiple clusters or in none at all [15]. In the analysis of a particular cellular process, therefore, most conditions do not contribute information but instead increase the amount of background noise [3]. Thirdly, due to the complex procedures of microarray experiments, gene expression data often contains a huge amount of noise. These algorithms force each gene into a cluster, which may cause the algorithm to be sensitive to noise [15][16][17].
Recently, subspace clustering methods have been proposed to find subgroups of genes that exhibit similar behavior across subsets of samples, experimental conditions, or time points [15,18]. Subspace clustering was first proposed by Agrawal et al. in general data mining domain [19] to find subsets of objects such that the objects appear as a cluster in a subspace formed by a subset of features. Figure 1 (B) shows an example of the subspace clusters (a and b) embedded in a gene expression matrix. From Figure 1 (B), we can found that a subspace cluster is defined as a submatrix spanned by a set of genes and a set of samples. Genes or samples can be part of more than one subspace cluster or of no subspace cluster. In addition, the subsets of samples for various subspace clusters can be different. Two subspace clusters can share some common genes and samples, and some genes may not belong to any subspace cluster [15]. Therefore, the goal of subspace clustering technique is to find a set of significant subspace clusters in a matrix. Until now, subspace clustering approaches have found widespread applications in many fields [20], e.g., pattern recognition, data compression, image processing and bioinformatics. These methods include algebraic methods [20], such as Generalized Principal Component Analysis(GPCA) [21], iterative methods [20], statistical methods [20], and spectral clusteringbased methods [20].
GPCA presents an algebraic way to model the data drawn from a union of multiple subspaces. The intersections between subspaces are automatically allowed; hence, GPCA can deal with both independent and dependent subspaces. However, this method is sensitive to noise and outliers due to the difficulty of estimating the polynomials from real data, which also causes the high computation cost of GPCA [20].
Algebraic algorithms are sensitive to the noise. A very simple way of improving the performance of algebraic algorithms in the case of noisy data is to use iterative refinement. The main advantage of algebraic method is its simplicity since it alternates between assigning points to subspaces and estimating the subspaces via PCA. While the algorithm's convergence to the global optimum depends on a good initialization, and the method is sensitive to outliers.
Statistical methods include Mixture of Probabilistic PCA (MPPCA) [22], Agglomerative Lossy Compression (ALC) [23] and Random Sample Consensus (RANSAC) [24]. MPPCA is a simple and intuitive method, where each iteration can be computed in closed form by using PPCA. An important drawback of MMPCA is that the number and dimensions of the subspaces need to be known beforehand. In principle, ALC does not need to know the number of subspaces and their dimensions. In practice, however, the number of subspaces is directly related to the parameter of ALC. When the number of subspaces is known, one can run ALC for several values of its parameter. This typically increases the computational complexity of the method. Another disadvantage of ALC, perhaps the major one, is that there is no theoretical proof for the optimality of the agglomerative procedure. The main advantage of RANSAC is its ability to handle outliers explicitly. An important drawback of RANSAC is that its performance deteriorates quickly as the number of subspaces increase. Another critical drawback of RANSAC is that it requires the dimension of the subspaces to be known and equal.
Spectral clustering algorithms are very popular techniques for clustering high-dimensional data. One of the main challenges in applying spectral clustering to the subspace clustering problem is to define a good affinity matrix. This is because two points could be very close to each other but lie in different subspaces (e.g., near the intersection of two subspaces). Conversely, two points could be far from each other but lie in the same subspace. One of the most popular spectral clustering algorithms is sparse subspace clustering (SSC) [25,26]. SSC is robust to noise and its computational complexity dose not grow exponentially with the number of subspaces and their dimensions. Nonetheless, it requires solving N optimization problems in O(N) variables, hence, it may be time consuming. Another possible disadvantage of SSC is that it is provably correct only in the case of independent or disjoint subspaces.
In this paper, we propose to use Low-Rank Representation(LRR) to identify the subspace gene clusters from microarray data. Compared with traditional techniques, such as K-means clustering, our method can cluster genes with similar functions but without similar expression profiles. Moreover, LRR is better fitted to analyze the microarray data than other subspace clustering algorithms, such as GPCA, since it is robust to noise and outliers via lowest-rank criterion, and it also can be capable of extracting useful information from a high level of background noise. The contribution of the paper is that a new subspace gene clustering method based on LRR is proposed, and its theoretical analysis is also given.

Low-Rank Representation
Before we present the Low-Rank Representation(LRR) based method for identifying gene clusters from microarray data, we first introduce the algorithm of Low-Rank Representation, which is a new framework for seeking the lowest rank representation matrix [27]. Supposing that there is a gene expression dataset with p genes and n samples, we can denote it as a matrix D with size p|n. When the data is noiseless, the LRR algorithm looks for a representation Z by solving the problem We call the optimal solutions Z Ã of the above problem the ''lowest-rank representation'' of the data D with respect to a dictionary A. The above optimization problem is difficult to solve due to the discrete nature of the rank function. Fortunately, as suggested by matrix completion methods [28,29], the following convex optimization provides a good surrogate for problem(1): where : k k Ã denotes the nuclear norm of a matrix [27,30], i.e., the sum of the singular values of the matrix. Note that the block diagonal structure of Z directly induces clustering genes (each block corresponds to a cluster). So the clustering task is equivalent to finding a block diagonal representation matrix Z.
However, due to the complex procedures of microarray experiments, gene expression data often contains a huge amount of noise. Therefore, the optimization model of LRR is formulated as: where E k k 2,1~P  errors E. Minimizing the l 2,1 -norm of noise is to meet the assumption that some data vectors are corrupted and others are clean. Since in this case, the solution Z Ã to Eq.(3) may not be block diagonal, it is recognized as an affinity matrix instead and spectral clustering methods are applied to DZ Ã DzD(Z Ã ) T D to obtain a block diagonal matrix, where T denotes the matrix or vector transpose and DZ Ã D denotes a matrix whose entries are the absolute values of Z.

Identifying Gene Clusters Using LRR
Denoted the gene expression data matrix as D with size p|n, each row of D containing the expression levels of a gene in all the n samples, and each column of D containing the expression levels of all the p genes in one sample. Our goal of using LRR algorithm to model the gene expression data is to discover subspace gene clusters, so we can cluster the genes according to their representation matrices. When we apply the LRR algorithm to cluster genes, we need to use D T instead of D. The Eq.(3) can be written as: In order to cluster the genes into their respective subspaces, we need to compute an affinity matrix that encodes the pairwise affinities between data vectors. So we use the data D T itself as the dictionary, i.e., problem(4) becomes  where X~½x 1 ,x 2 , ::: ,x p (each column is a gene), Z~½z 1 ,z 2 , ::: ,z p is the coefficient matrix with each z i being the representation of x i .
is l 2,1 -norm, and the parameter lw0 is used to balance the effects of the two parts, which could be chosen according to properties of the two norms, or tuned empirically. Since l 2,1 -norm encourages the columns of E to be zero, the underlying assumption here is that some data vectors are corrupted and the others are clean.
The optimization problem (6) is convex and can be solved by various methods [31]. For efficiency, we adopt in this paper the Augmented Lagrange Multiplier(ALM) method [31][32][33]. We first convert Eq.(6) to the following equivalent problem: This problem can be solved by the ALM method, which minimizes the following augmented Lagrange function: The columns of the table summarize the total sizes of the module (numbers in parentheses), the number of genes annotated in the cluster, the GO categories associated with the cluster, and the P-value after FDR correction. doi:10.1371/journal.pone.0059377.t002 The above problem is unconstrained. So it can be minimized with respect to J, Z and E, respectively, by fixing the other variables, and then updating the Lagrange multipliers Y 1 and Y 2 , where mw0 is a penalty parameter. The inexact ALM method, also called the alternating direction method, is outlined in Algorithm 1. Its convergence properties could be proven similarly as those in [32].
After solving problem (6), as in [25], we utilize the lowest-rank representation (denoted by Z Ã ) to define the affinity matrix of an undirected graph. We treat gene clustering problem as a graph partitioning problem. The gene vectors correspond to the vertices and the affinity between x i and x j is computed by D½Z Ã ij DzD½Z Ã ji D.
In clustering, we seek to partition the set of vertices into disjoint sets where by some measure the similarity among the vertices in a set is high and across different sets is low. So we could use the spectral clustering algorithms such as Normalized Cuts [34] to produce the final clustering results, since Normalized Cuts is an unbiased measure which can minimize the disassociation between the clusters of a graph and maximize the association within the clusters simultaneously. The final clusters are gene clusters that we discover from the gene expression data. Integrating LRR with spectral clustering has the following advantages. First, since LRR may fail to obtain a block-diagonal representation in complex applications, the spectral clustering could ensure the robustness of the clustering. Second, it is convenient to integrate the lowest-rank representation with other information by defining such an undirected graph For example, in some specific applications such as subspace segmentation, ones may want to enforce that only the neighbor samples can be connected by edges. Therefore, the proposed clustering method based LRR can be capable of extracting useful information from a huge amount of background noise and it also can well capture the intrinsic patterns of genes with similar functions. Algorithm 2 summarizes the whole clustering algorithm of LRR for gene expression data.
Algorithm 1 Solving Problem (6) by Inexact ALM Input: data matrix X, parameter l Initialized: while not converged do 1) fix the others and update J by 2) fix the others and update Z by  (6) 2) Construct an undirected graph by using the lowest-rank representation to define the affinity matrix of the graph 3) Use Normalized Cuts to cluster the vertices of the graph into k clusters

Gene Expression Data
In this paper, three published gene expression datasets are used to analyze the performance of our methods: the yeast dataset [35], the yeast_Spellman dataset [36] and nomal human tissue dataset [37].
The yeast dataset contains 173 samples collected under several different conditions, which include temperature shocks, hyper-and hypoosmotic shocks, exposure to various agents such as peroxide,

GeneCodis Analysis
GeneCodis is a web-based tool for finding sets of biological annotations that frequently appear together and are significant in a set of genes. It allows the integrated analysis of annotations from different sources and generates statistical rank scores for signal annotations and their combinations. GeneCodis is an important extension of existing tools for the functional analysis of genes lists [38][39][40], and it is publicly available at http:// genecodis.cnb.csic.es.
The application of GeneCodis is simple. It takes a list of genes which are in a cluster as input and determines biological annotations or combinations of annotations that are overrepresented with respect to a reference list. Meanwhile, selecting one or more categories that you want include in the analysis is necessary. In addition, the organism selected is Saccharomyces cerevisiae for yeast dataset and yeast_Spellman data, while the organism selected is Homo sapiens for normal human tissue dataset. When the genes are submitted, the modular enrichment analysis and singular enrichment analysis can be obtained. For a detailed description of this method, see the online help for the program.
In the GeneCodis method, P-values obtained through Hypergeometric analysis corrected by false discovery rate (FDR) method [41,42]. Briefly, a gene list of the same size of the input list is generated by randomly selecting genes from the set of genes defined as the reference distribution. The process of extracting frequent sets of annotations is repeated and P-values for the annotations and combinations of annotations generated from this random list are calculated using the same statistical test. This process is repeated 100 times and the corrected P-values for each set of K-annotations are calculated as the fraction of permutations The columns of the table summarize the total sizes of the cluster (numbers in parentheses), the number of genes annotated in the cluster, the GO categories associated with the cluster, and the P-value after FDR correction. doi:10.1371/journal.pone.0059377.t005 having any annotation of the same value of K with a P-value as good or better than the observed P-value.

Results and Discussion
In this section, the proposed method is applied to identify the subspace gene clusters in the form of functional links among genes. The GeneCodis is executed to investigate the enrichment of functional annotations of genes in each cluster. First, LRR based methods and K-means, GPCA are carried on the synthetic data. Then, these methods are used to find gene clusters from real gene expression data. The MATLAB code is supplied in File S1 and File S2.

Simulation on Synthetic Data
We first tested our method on synthetic data to assess its performance for clustering. We synthesize a network that consists of 2000 genes each with 200 samples and 20 transcription factors. We construct 20 independent subspaces S i f g 20 i~1 5R 200 whose bases U i f g 20 i~1 are computed by U iz1~T U i , 1ƒiƒ19, where T is a random rotation and U 1 is a random orthogonal matrix of dimension 200|100. So, each subspace has a dimension of 100. We sample 100 data vectors from each subspace by X i~Ui Q i , 1ƒiƒ20 with Q i being a 100|100 noise matrix, Q i *N 0,1 ð Þ. Then the noise matrix is added to data matrix with different Signal-to-Noise Ratios (SNR). The simulation scheme in [27] is used to generate the synthetic data. The average receiver operator characteristic (ROC) curves are shown in Figure 2 with four different SNR (signal-to-noise ratio). The corresponding l in LRR based method are set as 0.01, 0.1, 1 and 10, respectively.
From Figure 2, we can see that for different SNRs, our LRR based method consistently outperforms the competitive methods. When the noise level increases (i.e., SNR decreases), three methods suffer from performance degradation by a corresponding decrease in the AUCs (Table 1). However, the AUC range of LRR is from 0.9908 to 0.8928, while the AUC ranges of K-means and GPCA are from 0.9233 and 0.9255 to 0.6643 and 0.6154, respectively. The columns of the table summarize the total sizes of the cluster(numbers in parentheses), the number of genes annotated in the cluster, the GO categories associated with the cluster, and the P-value after FDR correction. doi:10.1371/journal.pone.0059377.t006 The decrease rates of K-means and GPCA are higher than that of LRR. In other word, K-means and GPCA are sensitive to the noise, while LRR is robust to the noise. Moreover, the proposed method can extract useful information from a high level of background noise.
From the experiments on synthetic data, a conclusion can be drawn that LRR method outperforms other methods for clustering.

Experimental Results on the Yeast Dataset
For yeast dataset, we first used KNNimpute [43] to fill in missing values. When we apply our method to cluster the genes, the parameters l and k need to be considered. Since the original gene expression matrix obtained from a scanning process contains noise, missing values, and systematic variations arising from the experimental procedure. The gene expression data contains a huge amount of noise. In our method, the parameter l is used to balance the effects of noise. For the yeast dataset, we take l = 0.1, because when taking this value, the enrichment analysis based on GO can achieve the most significant result. With regards to the parameter k, more clusters are discovered with the increase of k. In this experiment, according to the former work [44][45][46], we choose k = 30.   Table 2 lists the most enriched GO categories of modular enrichment analysis in each cluster uncovered from the yeast dataset. In this table, the first two columns list total number of genes in each cluster and the number of annotated genes in each cluster respectively. It should be noted that among the 6152 genes, only 6123 genes were annotated by GO and KEGG database. The last column on the right are P-values obtained through Hypergeometric analysis corrected by false discovery rate (FDR) method. In Table 3, we analyzed the function of genes clustered in two clusters. For example, there are 14 out of 30 genes clustered sharing similar function 'response to stress', and Cluster C3 is significantly associated with 'response to stress'. Therefore, genes share with similar function can be clustered in the same cluster. Table 4 lists a few functional categories in each significantly enriched modules (corrected P-value ,10 220 ). Figures 3 and 4 show the enriched combinations of significant annotations of Biological Process (BP) of C17 and that of Molecular Function of C17 respectively, using pie graphs and bar graphs.
We also adopted K-means clustering and GPCA as compared methods applying on the yeast dataset, and the experimental results are listed in Tables 5 and 6. These two tables show K-means clustering and GPCA are efficient in clustering the genes. However, our method can identify more significantly enriched clusters, e.g. 11 clusters were discovered with corrected P-value ,10 220 , while only 4 and 1 clusters by K-means and GPCA, respectively. The number of annotated genes in each cluster using our algorithm is more than that of other two methods.
Moreover, the proposed approach is effective in clustering together genes with similar expression profiles and similar function categories. Cluster C17, for example, is strongly associated with the 'structural constituent of ribosome' process (corrected P-value ,3.58803610 2202 ). The heatmap indicated similar expression patterns of genes under different experimental conditions ( Figure 5(A)). More significantly, our algorithm can cluster genes which show different expression profiles but similar functions. Cluster C14, for example, is significantly enriched by 'cellular  amino acid biosynthetic process' (corrected P-value ,2.33136610 241 ). The heatmap revealed as least two distinctive expression patterns in this cluster (denoted as a and b, Figure 5(B)). Which show the advantage of our subspace clustering method.
We also compared statistical significance of common enriched functional categories in gene clusters uncovered by our algorithm and K-means (Table 7). Among those common functional categories detected significantly by these methods, there are five out of eight functional categories that our method produced significantly lower corrected P-value than K-means method did. In addition, our method is robust to the noise, while other two methods are not. We also found the performance of GPCA is not better than that of K-means clustering for this dataset. To investigate the performance of these methods explicitly, Table 8 lists the average values of negative logarithm of corrected P-value on three datasets using K-means, GPCA and LRR, respectively. From Table 8 (a), we also can see that LRR based method outperforms other methods on yeast dataset.

Experimental Results on Yeast_Spellman Dataset
We also first used KNNimpute to fill in missing values. In this experiment, we chose l = 0.01 due to the dataset contains a huge amount of noise. Table S1 (In File S1 and File S2) lists the most enriched GO categories of modular enrichment analysis in each gene cluster uncovered from the yeast_Spellman dataset. In this dataset, there are 6074 genes of 6178 genes annotated by GO and KEGG. Table S2 also lists a few functional categories in each significantly enriched modules (corrected P-value ,10 210 ).  Tables S3 and S4, our approach can discover 11 significantly enriched clusters with corrected P-value ,10 210 , while 8 and 3 clusters by K-means clustering and GPCA, respectively. We also compared statistical significance of common enriched functional categories in gene clusters uncovered by our algorithm and K-means (Table S5). Among those common functional categories detected significantly by these methods, there are six out of eight functional categories that our method produced significantly lower corrected P-value than K-means method did. In the experiment we also found that the result of GPCA is not robust to noise. In addition, for yeast_Spellman dataset, the average value of negative logarithm of corrected P-value using LRR based method listed in Table 8 (b) is higher than the values listed in Table 8 (b) using K-means and GPCA, respectively. Therefore, LRR based method can achieve better result than other methods.

Experimental Results on Normal Human Tissue Dataset
In this experiment, we chose l = 10 since the dataset contains some noise but no missing values. Table S6 lists the most enriched GO categories of modular enrichment analysis in each gene clusters discovered from the human tissue dataset, and only 5991 genes were annotated by GO and KEGG among the 7070 genes. Among 30 gene clusters, ten clusters were significantly enriched by GO and KEGG (corrected P-value ,10 220 ). Each cluster appeared to be dominated by only a few functional categories (Table S7). Similar to above two experiments, Figure 7(A) shows the heatmap of Cluster C18, which is significantly enriched by 'muscle filament sliding' (corrected P-value ,1.29274610 239 ). It can be seen that the expression patterns of genes in this cluster are similar and similar function categories. On the contrary, the heatmap of Cluster C3, which is significantly enriched by 'extracellular region' (corrected P-value ,4.79992610 225 ), reveals different expression profiles but similar functions (Figure 7(B)).
The experimental results of the K-means clustering and GPCA methods are listed in Tables S8 and S9. Our method can discover 10 significantly enriched clusters with corrected P-value ,10 220 , while K-means and GPCA can identify 1and 3 clusters, respectively. Moreover, from Table 8 (c), we also can find that the performance of LRR based method is better than that of other methods. Similar to above two experiments, we compared statistical significance of common enriched functional categories in gene clusters uncovered by our algorithm and K-means (Table  S10). Among those common functional categories detected significantly by these methods, there are five functional categories that our method produced significantly all lower corrected P-value than K-means method did. For this dataset, the performance of GPCA is better than K-means clustering.

Conclusions
In this study, we present low-rank representation based method for identifying subspace gene clusters from microarray data. The new approach can cluster the genes via low-rank criterion. Our goal is to find a block diagonal representation matrix from gene expression data using low-rank representation. In this block diagonal matrix each block corresponds to a cluster. Therefore, the genes in each cluster have similar functions. Compared with other clustering methods, the proposed method offers several advantages. Firstly, it can identify genes of similar functions yet without similar expression profiles. Secondly, the method can assign one gene into different modules. Thirdly, our method is capable of extracting useful information from a high level of background noise. In a word, our method leads to a significant improvement in identifying biologically relevant gene clusters. In the experiment we also found that many categories discovered by different methods are different. So in practice, different methods can be used to find more reliable result.
Otherwise, subspace gene clusters identified using the proposed method may represent co-regulated genes to some degree. However, due to the limited information present in any dataset, genes in the same cluster might be co-expressed but not necessarily co-regulated [47][48][49]. Therefore, to design an effective algorithm for finding co-regulated genes is our future work.