Robust auto-weighted multi-view subspace clustering with common subspace representation matrix

In many computer vision and machine learning applications, the data sets distribute on certain low-dimensional subspaces. Subspace clustering is a powerful technology to find the underlying subspaces and cluster data points correctly. However, traditional subspace clustering methods can only be applied on data from one source, and how to extend these methods and enable the extensions to combine information from various data sources has become a hot area of research. Previous multi-view subspace methods aim to learn multiple subspace representation matrices simultaneously and these learning task for different views are treated equally. After obtaining representation matrices, they stack up the learned representation matrices as the common underlying subspace structure. However, for many problems, the importance of sources and the importance of features in one source both can be varied, which makes the previous approaches ineffective. In this paper, we propose a novel method called Robust Auto-weighted Multi-view Subspace Clustering (RAMSC). In our method, the weight for both the sources and features can be learned automatically via utilizing a novel trick and introducing a sparse norm. More importantly, the objective of our method is a common representation matrix which directly reflects the common underlying subspace structure. A new efficient algorithm is derived to solve the formulated objective with rigorous theoretical proof on its convergency. Extensive experimental results on five benchmark multi-view datasets well demonstrate that the proposed method consistently outperforms the state-of-the-art methods.


Introduction
In many applications such as computer vision, data mining, pattern recognition and machine learning, there exists an assumption that the data points are drawn from multiple lowdimensional subspaces with each subspace corresponding to one category or class. Subspace clustering [1,2] aims to explore the underlying subspace and cluster the data according to it. PLOS  Early subspace clustering methods can be roughly grouped into two categories: algebra based methods such as [3,4], and statistics based methods such as [5,6]. And recently, many methods [7][8][9][10][11][12][13][14][15][16] which belong to a new category, i.e., spectral clustering based [1] methods, have been proposed and these methods have achieved state-of-the-art performance. The core idea of spectral clustering based methods is to apply the self-representation property to compute affinities, i.e., represent every data point by a linear combination of other data points. However, these methods mostly focus on the features from single source rather than multiple ones.
In actual applications, data is often collected from diverse domains or obtained from different feature extractors, thus multi-view data are very common in many applications. For example, in computer vision, each image can be described by the color, texture, shapes and so on. In web mining, each web can be characterized by its content and link information, which are two distinct descriptions or views. In multi-lingual information retrieval, a document can be represented by several different languages. Since these different features can provide useful information from different views and these single-view subspace clustering methods have shown good performance, it is crucial to integrate these heterogeneous features to create more accurate robust multi-view subspace clustering methods.
More recently, a number of multi-view subspace clustering methods have been proposed [17][18][19]. The diversity-induced multi-view subspace clustering (DiMSC) was proposed in [17] to perform subspace clustering on different views simultaneously with a diverse term on the multiple representation matrices. The multi-view subspace clustering (MVSC) was introduced in [18] to perform clustering on the subspace representation of each view simultaneously with a common cluster structure. The low-rank tensor constrained multi-view subspace clustering (LT-MSC), which was proposed in [19], performs subspace clustering on different views simultaneously with a low rank tensor constraint, and the tensor is constructed by the subspace representation matrices. After obtaining subspace representation matrices, these methods use them to construct similarity matrices for different views independently and stack up these similarity matrices to a common one which represents the underlying common structures across different views. However, these methods neglect the different importance among views and the performance of their unified similarities may suffer when there is a less informative view.
In this paper, we try to solve the problem of subspace clustering for multi-view data. A novel method, named as Robust Auto-weighted Multi-view Subspace Clustering (RAMSC), has been presented. Different from the previous approaches [17][18][19] which treat different views equally and obtain a representation matrix in each view, our proposed method assigns a suitable weight for each view and purposes to learn a common representation matrix across different views to reflect the underlying common structure. Besides, the view weight factors can be tuned automatically and this process does not need any additional parameters. And by introducing an sparse norm, our proposed method is robust to the inaccurate features. We provide an effective algorithm to solve the proposed non-smooth minimization problem and prove that the algorithm will converge. In the algorithm, for each view, a feature weight matrix can be learned and we also proposes a new way to construct the common similarity matrix by utilizing the view weight factors and feature weight matrix. Compared to related state-of-theart clustering methods, our proposed method consistently achieves better performance on five benchmark multi-view data sets.
The rest of this paper is organized as follows. Section 2 introduces the background and motivation of this paper. In Section 3, we propose our method RAMSC with a solving algorithm. In Section 4, we present some deep analyses about the proposed algorithm RAMSC, including convergence behavior, computational complexity and parameter determination. Experimental results and conclusions are shown in Section 5 and Section 6, respectively.

Background and motivation
In this section, first we introduce some notations, then briefly review the previous subspace clustering methods to show our research motivation.

Notations
Throughout this paper, vectors and matrices are written in boldface uppercase letters and boldface lowercase letters, respectively. For a vector m, the ℓ 2 -norm of vector m is denoted by ||m|| 2 . And m (v) denotes that m is derived from v-th view. For a matrix M, we denote its i-th row, j-th column and ij-th element as m i: , m :j and m ij respectively. The trace of matrix M is denoted by Tr(M). And we denote M (v) as a matrix M derived from the v-th view representation. The ℓ r,p -norm of an matrix M 2 R dÂn is defined as [20,21] When r ! 1 and p ! 1, ℓ r,p -norm becomes a valid norm because it satisfies the three norm conditions.

Single-view and multi-view subspace clustering
Suppose X ¼ ½x :1 ; x :2 ; :::; x :n 2 R dÂn is the data matrix with d-dimensional features and n data points. The subspace clustering methods based on spectral clustering mainly have the following two steps: First, the self-representation property [7] is used to represent data matrix X as where Z ¼ ½z :1 ; z :2 ; :::; z :n 2 R nÂn is the self-representation matrix with each z :i being the representation of sample x :i , and E is the error matrix. The nonzero elements of z :i correspond to points from the same subspace. And Z can be obtained by solving: min Z jjX À XZjj l þ lOðZÞ; where || Á || l can be considered a proper norm on error matrix E, O(Z) and C are the regularizer and constraint set on Z, respectively, and λ > 0 is a balance parameter. The existing methods [7][8][9][10][11][12][13] distinguish each other by employing different constraints or regularizers on Z or E; Second, the obtained subspace structure Z is used to construct a similarity matrix S which encodes the pairwise similarity between data pairs by [22] Afterwards, spectral clustering algorithm [23] can be used on the computed similarity matrix S to get the final clustering results.
For multi-view data, suppose that V is the number of views and X (1) , X (2) , . . ., X (V) are used to denote data matrix of each view, where X ðvÞ 2 R d ðvÞ Ân for v = 1, 2, 3, . . ., V and d (v) is the v-th view dimensionality. The single-view subspace clustering methods can not be applied on multi-view data to obtain a representation matrix directly. One naive strategy is to concatenate all the features together as a new view, and then employ single view methods on the concatenated features. However, this method ignores the difference among multiple views. The previous multi-view subspace clustering methods consider that for each single view, a subspace representation Z ðvÞ 2 R nÂn should be learned. They stack up these V tasks and focus on how to explore the relationships among these V representation matrices Z (v) so that these Z (v) can be learned simultaneously. Two more reasonable strategies in multi-view learning are adopted by them to achieve their goal: The first one is to explore complementary information from multiple views. DiMSC [17] explores the complementary of these representations Z (v) by applying the Hilbert Schmidt independence criterion as a diversity term. LT-MSC [19] explores the complementary information from multiple views by regarding the subspace representation matrices Z (v) as a tensor, then equipping the tensor with a low-rank constraint. The second strategy is to explore the consistence among multiple views. MVSC [18] explores the consistence of these representations Z (v) by performing subspace clustering on individual modality respectively and then unifying them by a common indicator matrix.
After obtaining a representation Z (v) for each view, all the above-mentioned multi-view subspace clustering methods construct a similarity matrix S by Then they apply spectral clustering algorithm [23] on S to obtain clustering results. Although these multi-view subspace clustering methods have achieved good performance, there are mainly two drawbacks of these methods which leave room to improve the clustering performance: 1. These methods treat different views equally and neglect the different importance of different views. When they learn Z (v) , each view plays the same important role. When they construct the similarity matrix S, Eq (5) can be considered as S ¼ P V v¼1 S ðvÞ , where S (v) = (|Z (v)T | + |Z (v) |)/2 is a graph similarity matrix constructed from v-th view. This strategy may suffer when an unreliable similarity matrix is added to.
2. Eq (5) can also be considered as S = (|Z T | + |Z|)/2, where jZj ¼ P V v¼1 jZ ðvÞ j can be considered as the underlying common structures across different views. The optimizing objectives of these methods are representation matrices Z (v) , however, the final clustering results are determined by the common structure Z which may brings such a drawback that these Z (v) may have good properties because of the constraints or regularized terms, but Z may not keep these properties.
To address these two challenges, we will introduce our proposed novel multi-view subspace clustering method in next section.

Formulation and solution
In this section, we will first introduce the formulation of our method, and then an alternative algorithm will be presented to solve it.

Formulation
To overcome these two drawbacks, we propose a novel robust auto-weighted multi-view subspace clustering method. Our proposed method RAMSC utilizes a reasonable way to set view weight factors automatically and learns a common subspace representation Z which can be directly used to construct the common similarity matrix S across different views. Thus one important view can have a big weight, and the constraints or regularized terms can be set on Z which determines the final clustering results. The objective function of RAMSC is where λ is a tradeoff factor, jj Á jj p 2;p is the sparsity-inducing norm with 0 p 1 and each O (v) (Z) is a smooth regularized term. Denote the representation error matrix E ðvÞ 2 R d ðvÞ Ân of the v-th view as The jj Á jj p 2;p -norm of a matrix E 2 R dÂn is defined as where e ðvÞ i: is the i-th row of E (v) and ||E (v) || 2,p is the ℓ 2,p -norm as defined in Eq (1) with r = 2. O (v) (Z) aims to smooth the distribution of the common representation Z on the v-th view.
These v smooth regularized terms O (v) (Z) enforce the common subspace representation matrix Z to meet the grouping effect. This analogous smooth regularized term is also employed by [13,17,24]. Specifically, in our method, each regularized term O (v) (Z) for v = 1, 2, . . ., V is defined as: W ðvÞ ¼ ðw ðvÞ ij Þ is the weight matrix measuring the spatial closeness of the data points on v-th view.
can be constructed by many different ways [25][26][27][28][29]. To show the robustness of our method, we construct 0-1 binary weighted k-nn graphs for each view and k is set to be 5 in all experiments.
Intuitively, there is no weight factor explicitly defined in Eq (6), and all different views are treated equally. By the following analysis, it can provide a reasonable way to learn the weight factors of each view. The Lagrange function of problem (6) can be written as Taking the derivative of Eq (10) with respect to Z and setting the derivative to zero, we have a ðvÞ @ðjjX ðvÞ À X ðvÞ Zjj where a ðvÞ ¼ 1 2 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi jjX ðvÞ À X ðvÞ Zjj Eq (11) can not be directly solved because α (v) is dependent on the target variable Z. However, if α (v) is considered as the weight factor of the v-th view, and its value has been given or set to be stationary, Eq (11) can be considered as the solution of the following problem when these α (v) are calculated or given: a ðvÞ ðjjX ðvÞ À X ðvÞ Zjj Solving the problem (13) to obtain the common representation matrix Z seems more reasonable. This problem can be considered as a sum of two parts with a tradeoff factor λ. The first part P V v¼1 a ðvÞ jjX ðvÞ À X ðvÞ Zjj p 2;p is a linear combination of the subspace representation errors on each view. Increasing α (v) tends to reduce representation error on the v-th view. The second part is to smooth Z on a linear combination of Laplacian matrices with suitable weights α (v) , i.e., L ¼ P V v¼1 a ðvÞ L ðvÞ . According to [30][31][32], the accuracy of L can be higher than that of each L (v) or the sum of them P V v¼1 L ðvÞ . Supposing that the common representation Z can be calculated from Eq (13), this Z can be used to update α (v) according to Eq (12). Learning α (v) in this way has following reasonable explanations and merits: 1. If v-th view is good, then jjE ðvÞ jj p 2;p and Tr(Z L (v) Z T ) should be small, and thus according to Eq (12), the learned α (v) is large.
2. The ℓ 2,p -norm of E (v) enforces the ℓ p -norm along the features direction of representation error matrix E (v) , and the ℓ 2 -norm along the data points direction. Thus, when 0 p 1, the effect of inaccurate features in the learning of α (v) is reduced by the ℓ p -norm.
3. Unlike [31][32][33][34][35], which depends on an extra parameter to smooth the distribution of the view weights, learning α (v) by Eq (12) has no parameter to handle and it naturally avoids the trivial solution.
Although the problem (13) has a more reasonable form to learn a good common Z, there are difficulties to solve it, which comes from the following two aspects: (1) the jj Á jj p 2;p terms are nonsmooth; (2) when α (v) is calculated by Eq (12), α (v) and Z are coupled with each other. In next subsection, we will propose an alternative algorithm to tackle them efficiently.

Optimization algorithm
To solve Eq (13), we consider the following problem to tackle the non-smooth norm problem: can not be calculated when p < 2, in practice, we replace the ℓ 2,p -norm with the regularize ℓ 2,p -norm. And it is defined as: when ! 0, the regularized jj Á jj p 2;p of E (v) approximates the jjE ðvÞ jj p 2;p . Thus u ðvÞ ii now can be regularized as This strategy avoids a bad situation, 0 on the denominator, and guarantees that we can repeat the following alternative steps.
• The first step is fixing U (v) and α (v) , updating the common subspace representation Z.
Differentiating the objective function J with respect to Z and setting it to zero a ðvÞ X ðvÞT U ðvÞ X ðvÞ ; Eq (19) is a standard Sylvester equation, and according to [36], it has a unique optimal solution.
• The second step is fixing α (v) and Z, updating the feature weight matrix U (v) for each view.
The representation error matrix E (v) of each view is calculated by current Z, and then each diagonal element of U (v) is updated by Eqs (16) or (18).
• The third step is fixing Z and U (v) , updating the view weight factors α (v) for each view by Eq (12).
By the above three steps, we alternatively update Z, U (v) as well as α (v) , and repeat the process iteratively. Until now, we can draw the following conclusions: 1. In the above procedures, the alternating optimization converges, and Z Ã which denotes the converged value of Z is at least a local optimal solution to Eq (6). (We will prove this conclusion in next section).
2. The second one is about initialization. Since these procedures can reach a local optimum of Eq (6), it is important to have a sensible initialization. We initialize all views with equal a ðvÞ ¼ 1 V as in previous approaches [37,38]. And as in previous researches [21,39], we initialize U (v) = I (v) since every feature on each view has the same importance at the beginning.
After obtaining the common self-representation matrix Z Ã , the similarity matrix S 1 can be defined as and use the spectral clustering algorithm to produce the final clustering results, as has been adopted by traditional single-view subspace clustering methods. Some single-view subspace clustering methods also use other ways to construct similarity matrix [13]. In this paper, to better exploit the merit of grouping effect, we further utilize the learned view weight factors α (v) and feature weight matrices U (v) to define a new similarity matrix S 2 as where p b x ð1ÞT :i ; :::; Based on the above analysis, we summarize the procedures of our method RAMSC in Algorithm 1.

Performance analysis 4.1 Convergence analysis
To prove that the proposed Algorithm 1 converges and it can reach at least a local optimal solution of Eq (6), we first need to introduce the following lemma [21].
Lemma 1 When 0 < p 2, for any positive number a and b, the inequality holds: Thus the alternating optimization will monotonically decrease the objective of the problem (6) in each iteration until it converges. In the convergence, the converged Z Ã satisfy the Eq (11) which is the KKT condition of problem (6). Therefore, Z Ã is at least a local optimal solution of the problem (6).

Computational complexity and parameter determination
As seen from the procedure of RAMSC in Algorithm 1, we have solved this problem in an alternative way. The computational complexity in solving each problem is listed as follows. (1) The problem in Eq (19) can be solved by the Bartels-Stewart algorithm which has a computational complexity of Oðn 3 Þ; (2) The problem in Eq (16) can be effectively solved by computing the 2-norm of a vector. The computational complexity is Oð P V v¼1 ðd ðvÞ Þ 2 Þ; (3) Solving the problem in Eq (12) to update the optimal weight for each view has complexity Oðn 2 Â VÞ. In summary, the total computational complexity of RAMSC is OðT Â max fn 3 ; where T is the number of iteration. Since parameter determination is still an open problem [40,41], we determine the parameters of our method empirically as in previous researches. As for p, it is designed to add sparsity to representation error matrices E (v) which can alleviate the effect of inaccurate features in the learning of α (v) . Paper [43] is a timely and comprehensive survey, and a very good material to master the sparse learning field. According to it, we set p = 1, and this setting has been proven to be effective in most applications [20,42]. As for k, it is the neighbor number to construct graphs W (v) . Methods [13,17] using similar regularized terms perform stably with different k, so we construct 5-nn graphs.
As for the parameter λ, it is very vital to the final performance since it is employed to balance the self representation accuracy and the smoothness of Z. Since there is no prior information about λ, we determine it by grid search in a heuristic way as in previous researches [13,17,42]. Concretely, λ is tuned from 1,2 and 5 to 60 with an incremental step 5 to get the best λ. When Eq (22) is used to construct the similarity matrix, we search it from 0.1 to 2 with an incremental step 0.2 to get the best γ.

Experiments
In this section, our proposed RAMSC has been evaluated on five widely used data sets, and some numerical results of its convergency behaviors and also have been shown.

Data set descriptions
To validate the effectiveness of our method, we use five multiview benchmark datasets. They are various kinds of data arisen in many real applications with different characters and commonly used in multiple view learning. They are Microsoft Research Cambridge Volume 1 (MSRC-v1) [44], Caltech101 [45], NBA-NASCAR [46], Handwritten Dutch Digit Recognition (Digit) [47] and Web Knowledge Base (WebKB) [48]. The statistics information of the five data sets is concluded in Table 1

Experimental setup
To evaluate the performance of our method, we have compared our method with each single view counterpart. Single view methods on the concatenated features are also compared. Besides, we compare with other state-of-the-art methods, including robust multi-view K-means clustering (RMKMC) [33], pair-wised co-regularized multi-modal spectral clustering (PC-SPC) [30], centroid co-regularized multi-modal spectral clustering (CC-SPC) [30], multi-view subspace clustering (MVSC) [18] and diversity induced multi-view subspace clustering (DiMSC) [17].
• SPC: We employ the standard spectral clustering (SPC) [23] algorithm directly on each view, and report the results as baselines.
• SMR: We first run smooth representation clustering (SMR) [13] on each view features to get the subspace representations, and then run spectral clustering on such representations.
• SPC-CON and SMR-CON: We first concatenate all features together as a new single view, and then run SPC [23] and SMR [13] respectively on it.
• RMKMC: The robust multi-view K-means clustering method obtains the common cluster indicators across multiple views by minimizing the linear combination of the relaxed Kmeans on each view with learned weight factors.
• PC-SPC: This method enforces the corresponding point in different modality to have the same cluster membership by a pair-wised co-regularization term, which makes different views be same to each other.
• CC-SPC: This method is similar to PC-SPC, other than a centroid-based co-regularization term, which makes different views be same to a common one.
• MVSC: This method perform subspace clustering on individual modality respectively and then unify them with a common indicator matrix.
• DiMSC: This method learns subspace representations and employs the Hilbert-Schmidt Independence Criterion to enhance complementary information.
For fair comparison, we download the source codes of the compared methods from the authors' websites and follow their experimental settings and the parameter tuning steps in their papers to get their best parameters. And for RAMSC, we construct 0-1 binary 5-nn graphs W (v) for each view and the p is fixed 1 in all experiments. Thus only one parameter λ in our method needs to be tuned. We search the best λ from 1, 2 and 5 to 60 with incremental step 5. RAMSC(S 2 ) denotes that we use Eq (22) to construct S 2 , and the best parameter γ is searched from 0.1 to 2 with incremental step 0.2. And the experimental results are corresponding to their best parameters.
Before we do the clustering work, we first normalize each view of the multi-view data to make all the values in the range [−1, 1]. All the experiments are repeated 50 times independently, and the mean and standard deviation of the results are reported.
Three standard clustering evaluation metrics are utilized to measure the multi-view clustering performance, that is, Clustering Accuracy (ACC), Normalized Mutual Information (NMI) and Purity.

Experimental results
The experiment results of the five datasets with three metrics are shown in Tables 2, 3, 4, 5 and 6. In terms of the clustering accuracy, we have the following observations. Tables 2, 3, 4, 5 and 6, we conclude that our proposed method outperforms the competing methods on all the benchmark datasets. And although ACC, NMI and Purity are three different evaluation metrics, they all indicate the advantages of our method. The clustering results show the effectiveness of the way to construct similarity matrix by Eq (22), and compared the way of Eq (21), it can achieve better or at least comparable performance. Tables 2, 3 and 4, it can be seen that some individual view features are more discriminative for performing clustering. And as for the comparison between single view methods Robust auto-weighted multi-view subspace clustering and previous multi-view approaches, the previous multi-view clustering methods can not always achieve better performances. This may be caused by the fact that previous methods characterize the structures of each view data separately and combine them by simply addition operations, which makes the final clustering results affect by these inaccurate structures. Our approach can perform better than single view methods in most cases because our method distributes small weight factors for inaccurate views and learns a common self representation matrix Z which can be used to construct a common similarity matrix S among different views.

From
3. Tables 5 and 6 show the robustness of our method. On NBA-NASCAR data set, all the competing methods except RMKMC can not achieve reasonable performance. It is because that RMKMC utilizes a weight factor for each view and the sparsity-inducing norm to eliminate the influence of the outliers, while the other competing methods do not consider the ouliers and sparsity of the input data. Compared with RMKMC, our method learns view weight factors automatically without an additional parameter and use the jj Á jj p 2;p norm to Robust auto-weighted multi-view subspace clustering eliminate the influence of the inaccurate features. Our method has better performance on NBA-NASCAR data set and still achieve good performance on WebKB data set when all other compared methods do not work.

Convergence behavior
In order to verify the convergence of Algorithm 1, we present the numerical results of the convergence behavior on datasets MSRC-v1 and Caltech101-7.
The convergence curves are displayed in Fig 1. As shown in Fig 1, the objective values of Eq (6) are non-increasing during the iterations and converge to a fixed value. Additionally, our algorithm converges within 10 iterations which means it has fast convergence speed.
As we can see from the results in in Figs 2 and 3, it is clear that the final clustering results of RAMSC and RAMSC(S 2 ) are affected by different λ and the combinations of λ and γ, respectively. Besides, on the two data sets, RAMSC has different optimal λ, and RAMSC(S 2 ) has different optimal combination of λ and γ because the two data sets have different data characteristics. Robust auto-weighted multi-view subspace clustering

Conclusion
In this paper, we have proposed a novel robust auto-weighted multi-view subspace clustering model, named RAMSC. This model can naturally assign suitable weights for each view and learn a common representation matrix. The common representation matrix can be used to construct a similarity matrix directly. Moreover, by imposing the structured sparsity norm, our method is robust to the inaccurate features. And the relative proof guarantees that the proposed method can converge to a local optimal solution. Experimental results on five data sets show that our proposed method enables a higher degree of accuracy than the state-of-the-art methods. However, there still remains several problems for future work:  Robust auto-weighted multi-view subspace clustering • A series of relative methods need to be developed and systematically compared. The core idea of our method is to learn view weights automatically and find a high-quality common subspace representation matrix. Based on it, we list three possible ways to develop new relative methods. First, the smooth regularized terms of our method can be replaced by others; Second, the sparsity norm on error matrix can be considered to replace by other reasonable norms; Third, our method has no constraint, and some constraints on the common representation matrix or the error matrix can be added. According to specific applications, corresponding relative methods can be proposed.
• Another open problem lies in the selection of the parameters, especially in the balance parameter λ, which is still an unsolved problem in many learning algorithms. In this paper, we determine it empirically. Additional theoretical analysis is also needed for this topic.
Supporting information S1 Appendix. RAMSC. A file contains matlab codes of RAMSC and the normalized datasets used in this paper. (ZIP)